Tutorial: medical system (human HOTL oversight)
🟡 Partial — The medical.json scenario runs the full ex-ante evaluation (11 controls, GREEN gate). The worst-subgroup control (worst-subgroup-dice) is AUDIT: it fails on its point value but does NOT block the gate; the risk is treated with declared human HOTL oversight (Art.14). The evaluation does not run in hermetic CI (requires GPU + TCIA cache).
This tutorial walks through Venturalítica’s medical scenario: a high-risk medical device software (SaMD) under the MDR (EU Reg. 2017/745) and EU AI Act Annex III §6. The vertebral segmentation engine is TotalSegmentator operating on the public TCIA Spine-Mets-CT-SEG cohort (≈55 real CT cases). The configuration fragments below are illustrative examples: they show how to write a sei.yaml for a medical-imaging system, not a literal dump of any internal file.
What you will learn:
- How the engine manages a medical imaging system with segmentation metrics (Dice,
es_dice) and equity metrics (score_gap,worst_cell_score) - Why the gate can stay GREEN even with an unmet AUDIT control (the worst subgroup): an audit control records the residual but does not block
- How that residual is treated with human HOTL oversight (Art.14, the
human-oversight-worst-case-reviewcontrol), not with a parameter or code change - How the MDR framework is applied via
frameworksin the risk programme (therisk:section ofsei.yaml) - The contrast with
tutoriales/primer-sistema(loan = code change): same engine, different regulatory framework
Project context
Section titled “Project context”medical models a vertebral-segmentation medical device software (SaMD) for spinal surgical planning:
- High-risk under MDR Annex I (medical device software) and EU AI Act Annex III §6
- Task: volumetric vertebral segmentation in CT
- Model: TotalSegmentator v2 (BYO, no proprietary fine-tuning)
- Cohort: TCIA Spine-Mets-CT-SEG — ~55 real CT cases with spinal metastases (effective cohort ≈33 after filtering), multi-scanner, multi-protocol
- Pipeline: DVC, a single
evaluatestage (segments in memory, no intermediate CSV — removed because it was PHI-shaped and a source of staleness) - AssuranceProgram (the
risk:section ofsei.yaml): 4 risks, 14 measures → 11 ex-ante controls
The accuracy measurement is the Dice (Sørensen similarity coefficient) per case, with bootstrap confidence intervals clustered by patient. Equity metrics are es_dice (Equity-Scaled Dice, FairSeg ICLR’24) and score_gap.
Step 1 — Prepare the environment
Section titled “Step 1 — Prepare the environment”-
Verify that
seiis compiled:Terminal sei --helpAll 16 subcommands should appear. If not, follow the installation guide.
-
Create a working directory and copy the scenario files:
Terminal mkdir /tmp/medical && cd /tmp/medicalcp -r /path/to/seigarrena/crates/seigarrena-cli/tests/resources/medical/. . -
Initialise git and DVC, and register the cohort metadata:
Terminal git initdvc initdvc add data/trusted_metadata.csvgit add sei.yaml dvc.yaml \data/trusted_metadata.csv.dvc data/cohort.croissant.json \eval/ .dvc/ .gitignore -
Create the Python environment with
uv:Terminal uv venv .venvuv pip install \"venturalitica[imaging]==0.6.12" \"dvc>=3" \"TotalSegmentator" \"monai" "nibabel" "pydicom" "pydicom-seg" \"SimpleITK" "pandas" "numpy" "pyyaml"venturalitica[imaging]includes the native segmentation metrics from the registry (Dice, NSD,es_dice,score_gap), available from version 0.6.12.
Step 2 — Project files
Section titled “Step 2 — Project files”medical/├── sei.yaml # System manifest (includes the Art. 9 + MDR risk programme)├── shared_data/policies/ # destination of the OSCAL assessment_plan generated by sei compile├── dvc.yaml # Pipeline: 1 stage `evaluate` (in-memory, no CSV)├── data/│ ├── trusted_metadata.csv # Real acquisition metadata (TCIA via NBIA API)│ └── cohort.croissant.json # Data governance §2 (Croissant, EU AI Act Art. 10)└── eval/ ├── eval.py # DVC stage: segments in memory + audits with SDK ├── segment.py # segment_cohort() — TotalSegmentator/GPU └── dicom_utils.py # DICOM/SEG utilitiessei.yaml — the manifest
Section titled “sei.yaml — the manifest”Below is an illustrative, abbreviated version of the manifest (the risks: block is summarised; the scenario file ships the full one). It is meant to convey the shape of a sei.yaml:
apiVersion: seigarrena.dev/v1alpha1kind: AISystemsystem: name: spine-segmentation intended_purpose: "Segmentacion vertebral en TC sobre cohorte de validacion (eval-only)."task: modality: image type: segmentationeval: { script: eval/eval.py }pipeline: { tool: dvc, metrics: metrics.json }oscal: { assessment_plan: shared_data/policies/assessment_plan.oscal.yaml }dataset: { croissant: data/cohort.croissant.json }artifacts: model: { kind: totalsegmentator, seed: 0 }risk: appetite: { individual: HIGH, society: HIGH, organization: HIGH } criteria: { scale: "5x5" } risks: # 4 risks with their measures (each carrying MDR frameworks) — see table belowThe appetite is set to HIGH for all three dimensions: the device operates in a high-risk environment (MDR) where the appetite is higher than in a financial system. The gate controls (global Dice, equity by sex/age, robustness by scanner) must exceed the clinical threshold for the gate to be GREEN; the audit controls (such as the worst subgroup) record their value but do not block it.
dvc.yaml — a single stage
Section titled “dvc.yaml — a single stage”The pipeline has a single evaluate stage that segments the cohort in memory (no intermediate CSV) and audits the results against the OSCAL assessment plan:
stages: evaluate: cmd: .venv/bin/python eval/eval.py deps: - eval/eval.py - eval/segment.py - eval/dicom_utils.py - data/trusted_metadata.csv - shared_data/policies/assessment_plan.oscal.yaml metrics: - metrics.json: cache: falseThe handoff between segmentation and auditing is in memory: segment_cohort() returns a DataFrame with a Dice column per patient, which the SDK evaluates as registry metrics (mean_score, worst_cell_score, es_dice, etc.) with power-stats clustered by patient.
The risk: section — 4 risks, dual regulation
Section titled “The risk: section — 4 risks, dual regulation”The AssuranceProgram (the risk: section of sei.yaml) declares 4 risks with 14 measures. Each measure carries frameworks: [eu/mdr@2017#gspr-<n>] — the GSPR from MDR Annex I that corresponds to that control:
| Risk | Key measures | MDR GSPR |
|---|---|---|
risk.subgroup-segmentation-bias | worst-subgroup-dice (audit, critical — does not block), human-oversight-worst-case-review (HOTL, Art.14) | GSPR-9 (risk control) |
risk.segmentation-accuracy-failure | vertebra-segmentation-dice (gate), data-leakage-check (gate) | GSPR-17.1 (software performance) |
risk.data-governance-failure | fairness-sex-disparate-impact (es_dice, gate), privacy-k-anonymity (gate) | GSPR-17.1 (data governance) |
risk.scanner-protocol-bias | robustness-scanner-bias, fairness-scanner-model (both gate) | GSPR-17.1 (multi-scanner performance) |
worst-subgroup-dice has enforcement: audit (severity: critical but non-blocking): it records the worst subgroup and surfaces the residual honestly, but does not bring down the gate. The gate coverage of prEN 18228 cl. 9.2 comes from vertebra-segmentation-dice and fairness-sex-disparate-impact (both enforcement: gate).
The two post-market controls (postmarket-subgroup-drift, postmarket-dice-drift) and the HOTL oversight (human-oversight-worst-case-review) have lifecycle: [monitoring]: sei compile excludes them from the 11 ex-ante controls of the assessment plan (14 measures − 3 monitoring = 11).
See The three treatment modalities for the full regulatory framing of each risk.
Step 3 — sei compile: AssuranceProgram → OSCAL (11 ex-ante controls)
Section titled “Step 3 — sei compile: AssuranceProgram → OSCAL (11 ex-ante controls)”sei compilesei compile reads the risk: section of sei.yaml, builds the AssuranceProgram in memory, and generates shared_data/policies/assessment_plan.oscal.yaml with the 11 ex-ante controls (excluding the two lifecycle: [monitoring] ones). This file is the dependency of the evaluate DVC stage; its digest enters dvc.lock and the signed bundle.
git add shared_data/policies/assessment_plan.oscal.yamlgit commit -m "init medical REAL (TCIA cohort + TotalSegmentator GPU; online, no cohort_results.csv)"See the sei CLI reference for details on the compile subcommand.
Step 4 — sei run: ex-ante evaluation with a GREEN gate and an unmet AUDIT control
Section titled “Step 4 — sei run: ex-ante evaluation with a GREEN gate and an unmet AUDIT control”sei runsei run triggers dvc repro, which executes the evaluate stage:
segment_cohort()passes all TCIA images through TotalSegmentator/GPU and returns a DataFrame with the real Dice per case- The SDK (
vl.enforce(data=df, cluster=PatientID, phase=validation)) evaluates the 11 assessment-plan controls with power-stats clustered by patient - The engine applies the gate and signs the bundle
Real result documented: sei run returns exit 0 — GREEN gate. The model is globally fit (global Dice ≈0.90). The results are:
| Control | Operator | Threshold | Result | Enforcement | Status |
|---|---|---|---|---|---|
vertebra-segmentation-dice | > | 0.85 | ~0.90 (global) | gate | PASSES |
data-leakage-check | < | 0.99 | (no Dice≈1.0) | gate | PASSES |
fairness-sex-disparate-impact | > | 0.80 | fit (es_dice) | gate | PASSES |
worst-subgroup-dice | > | 0.75 | below threshold (worst cell) | audit (non-blocking) | UNMET (recorded) |
The gate stays GREEN because all enforcement: gate controls pass (global Dice, equity by sex/age, robustness by scanner, k-anonymity). The worst-subgroup-dice control (MDR GSPR-9, severity: critical) fails on its point value —the worst composite subgroup falls on atypical cases— but it has enforcement: audit: it records the residual and surfaces it honestly in Annex IV §4, without bringing the gate down. The worst subgroup in a small cohort with atypical cases is a limitation of the BYO model, not a defect to be corrected with a gate; it is treated with human oversight (HOTL), not by gating to green.
Verify the signature and check the status:
sei verify # OK — ECDSA-P256+DSSE+in-toto signature validsei status # exit 0 — GREEN gate (prints worst-subgroup-dice as an unmet AUDIT control)sei status returns exit 0 (green gate) and prints worst-subgroup-dice because it is an unmet AUDIT control —so the residual stays visible— but it does not put the gate in the red.
Step 5 — Treating the subgroup residual: human HOTL oversight (Art.14)
Section titled “Step 5 — Treating the subgroup residual: human HOTL oversight (Art.14)”The worst-subgroup-dice control fails on its point value, but it is audit (does not block): the gate is already GREEN. The subgroup residual is not treated with a parameter or code change —the model is BYO (TotalSegmentator), not retrained— but with a declared ex-ante organisational control: human-in-the-loop oversight (HOTL, Art.14).
In the risk: section of risk risk.subgroup-segmentation-bias, the treatment is method: REDUCE with controls: [eu/ai-act@2024#art-15, eu/ai-act@2024#art-14], and the confirming control is:
- id: human-oversight-worst-case-review metric: hotl_worst_case_review_declared constraint: "> 0" severity: high enforcement: audit evaluation: manual lifecycle: [monitoring] article: "14" frameworks: [eu/mdr@2017#gspr-17.1] standard_clauses: ["eu/pren-18228@2026#14", "iso/42105@2026#intervention"]A clinician validates the segmentation of the worst-subgroup cases before downstream use (surgical planning). The control is declared ex-ante in the manifest and operates post-market (Phase 2, lifecycle: [monitoring]): that is why sei compile excludes it from the 11 ex-ante controls, but it is recorded as a means of conformance for Art.14.
Declaring that oversight is the ISO 23894 §6.5 treatment: a versioned change in sei.yaml, with authorship and timestamp, documenting the proportionate response to the subgroup residual.
git commit -m "treat(medical): declare HOTL oversight (Art.14) for worst-subgroup cases"See The three treatment modalities for the regulatory framing of the human-oversight modality.
Step 6 — Scenario status: GREEN gate, residual treated
Section titled “Step 6 — Scenario status: GREEN gate, residual treated”sei verify # OK — signed bundle, GREEN gatesei status # exit 0 — GREEN gate; worst-subgroup-dice listed as an unmet AUDIT controlThe medical.json scenario ends here: the gate is GREEN, the model is globally fit (global Dice ≈0.90), and the subgroup residual is recorded (by the audit control worst-subgroup-dice) and treated (by the declared HOTL oversight). There is no FAILS→PASSES arc from a parameter adjustment: the gate was never red. The ISO 23894 §6.5 cycle for risk risk.subgroup-segmentation-bias is closed by the HOTL treatment, not by re-measurement.
Step 7 — sei conformance: MDR + EU AI Act projection
Section titled “Step 7 — sei conformance: MDR + EU AI Act projection”sei conformance --standard eu/pren-18228@2026 --outThe same bundle projects onto prEN 18228 (the harmonised risk management standard, Art. 9):
| Clause | Result | Why |
|---|---|---|
| cl. 9.2 (control measures) | Covered | Provided by the gate controls vertebra-segmentation-dice and fairness-sex-disparate-impact (enforcement: gate, aligned with cl. 9.2); worst-subgroup-dice is audit (confirming, does not provide the gate coverage) |
| cl. 10 (global residual) | Covered / Gap | The engine aggregates the global residual (evaluate_overall_residual()) and sei conformance cl.10 emits COVERED or GAP; it is only advisory in the sei run gate |
sei conformance --standard iso/23894@2023ISO 23894 projection: clause §6.5 is covered when the cycle of risk.subgroup-segmentation-bias is closed (here, by the declared HOTL treatment).
The frameworks: [eu/mdr@2017#gspr-9] field in the risk: section causes the bundle to include MDR traceability in the conformance reports without any additional annotation.
The Annex IV (Art. 11) is not emitted by the engine: it is assembled and rendered by the cloud (control plane) from the signed bundle.json, with provenance per field, and delivered as a PDF. There is no sei annex-iv subcommand. The cloud can mark §2/§3/§4/§9 PENDING if their input is missing (data / misuse+residuals+Art.14 / control results / post-market measures); §7 and §8 are never pending. In this scenario, with data and controls present, only §3 and §9 would stay pending. See Guide: Render the Annex IV.
Contrast with tutoriales/primer-sistema (loan)
Section titled “Contrast with tutoriales/primer-sistema (loan)”| Dimension | loan (credit) | medical (medical imaging) |
|---|---|---|
| Regulatory framework | EU AI Act + DORA (financial entity) | EU AI Act + MDR (SaMD) |
| Treatment modality | Code change (train.py V1→V2, fairlearn) | Human HOTL oversight (Art.14, declared organisational control) |
| Model | LogReg, trained in-pipeline | TotalSegmentator v2, BYO (no fine-tuning) |
| Primary metric | demographic_parity_diff (tabular) | Volumetric Dice (image) |
| Equity | disparate_impact (binary, by gender) | es_dice (ESSP, equity over continuous Dice) |
| Power-stats | Bootstrap per sample (n=1000) | Bootstrap per patient cluster (n≈33 effective) |
| Gate at T0 | RED — global control fails (fairness) | GREEN — all gate controls pass; the worst subgroup (audit) records the residual |
| Response to residual | Retrain (Drift B: model changes) | Declare HOTL oversight (organisational control, model and data unchanged) |
The engine is the same in both scenarios. The Reproducer seam dispatches dvc repro in both cases. What changes is the treatment modality and the regulatory framework — declared in the AssuranceProgram via frameworks and projected in the conformance reports without modifying the engine.
What this tutorial demonstrates
Section titled “What this tutorial demonstrates”- An AssuranceProgram with dual regulation (EU AI Act + MDR) declared per measure via
frameworks - 11 ex-ante controls generated by
sei compilefrom 4 risks and 14 measures (excluding 3 withlifecycle: [monitoring]) - A GREEN gate on a real GPU cohort: all gate controls pass (global Dice ≈0.90); the worst subgroup (audit) records the residual without blocking
- The treatment via human HOTL oversight (Art.14): a declared ex-ante organisational control, not a parameter or code change
- Honest UNDERPOWERED: at an effective cohort ≈33 the bootstrap CIs are wide; the worst subgroup is modelled as audit precisely because at small n it can be noise
- Dual conformance projected over the same bundle: prEN 18228 + ISO 23894
Pending gaps
Section titled “Pending gaps”| Gap | Why it matters |
|---|---|
| HOTL declared ex-ante, post-market operation without live data | The human-oversight-worst-case-review control documents the plan; the clinical-review evidence is runtime (Phase 2) |
| NSD / HD95 as cross-check | The Dice size-bias caveat (arXiv 2509.19778) requires cross-checking with boundary metrics; future line |
Post-market controls (postmarket-subgroup-drift) | Ex-ante only in v1; Art. 72 not modelled in the engine |
| Annex IV §3/§9 PENDING in this scenario | §2/§3/§4/§9 can be pending if their input is missing; here only §3 and §9 (depend on the deployer’s environment); §7 and §8 are never pending |
See State and incompleteness for the complete gap map.