Skip to content

Tutorial: medical system (human HOTL oversight)

🟡 Partial — The medical.json scenario runs the full ex-ante evaluation (11 controls, GREEN gate). The worst-subgroup control (worst-subgroup-dice) is AUDIT: it fails on its point value but does NOT block the gate; the risk is treated with declared human HOTL oversight (Art.14). The evaluation does not run in hermetic CI (requires GPU + TCIA cache).

This tutorial walks through Venturalítica’s medical scenario: a high-risk medical device software (SaMD) under the MDR (EU Reg. 2017/745) and EU AI Act Annex III §6. The vertebral segmentation engine is TotalSegmentator operating on the public TCIA Spine-Mets-CT-SEG cohort (≈55 real CT cases). The configuration fragments below are illustrative examples: they show how to write a sei.yaml for a medical-imaging system, not a literal dump of any internal file.

What you will learn:

  • How the engine manages a medical imaging system with segmentation metrics (Dice, es_dice) and equity metrics (score_gap, worst_cell_score)
  • Why the gate can stay GREEN even with an unmet AUDIT control (the worst subgroup): an audit control records the residual but does not block
  • How that residual is treated with human HOTL oversight (Art.14, the human-oversight-worst-case-review control), not with a parameter or code change
  • How the MDR framework is applied via frameworks in the risk programme (the risk: section of sei.yaml)
  • The contrast with tutoriales/primer-sistema (loan = code change): same engine, different regulatory framework

medical models a vertebral-segmentation medical device software (SaMD) for spinal surgical planning:

  • High-risk under MDR Annex I (medical device software) and EU AI Act Annex III §6
  • Task: volumetric vertebral segmentation in CT
  • Model: TotalSegmentator v2 (BYO, no proprietary fine-tuning)
  • Cohort: TCIA Spine-Mets-CT-SEG — ~55 real CT cases with spinal metastases (effective cohort ≈33 after filtering), multi-scanner, multi-protocol
  • Pipeline: DVC, a single evaluate stage (segments in memory, no intermediate CSV — removed because it was PHI-shaped and a source of staleness)
  • AssuranceProgram (the risk: section of sei.yaml): 4 risks, 14 measures → 11 ex-ante controls

The accuracy measurement is the Dice (Sørensen similarity coefficient) per case, with bootstrap confidence intervals clustered by patient. Equity metrics are es_dice (Equity-Scaled Dice, FairSeg ICLR’24) and score_gap.


  1. Verify that sei is compiled:

    Terminal
    sei --help

    All 16 subcommands should appear. If not, follow the installation guide.

  2. Create a working directory and copy the scenario files:

    Terminal
    mkdir /tmp/medical && cd /tmp/medical
    cp -r /path/to/seigarrena/crates/seigarrena-cli/tests/resources/medical/. .
  3. Initialise git and DVC, and register the cohort metadata:

    Terminal
    git init
    dvc init
    dvc add data/trusted_metadata.csv
    git add sei.yaml dvc.yaml \
    data/trusted_metadata.csv.dvc data/cohort.croissant.json \
    eval/ .dvc/ .gitignore
  4. Create the Python environment with uv:

    Terminal
    uv venv .venv
    uv pip install \
    "venturalitica[imaging]==0.6.12" \
    "dvc>=3" \
    "TotalSegmentator" \
    "monai" "nibabel" "pydicom" "pydicom-seg" \
    "SimpleITK" "pandas" "numpy" "pyyaml"

    venturalitica[imaging] includes the native segmentation metrics from the registry (Dice, NSD, es_dice, score_gap), available from version 0.6.12.


medical/
├── sei.yaml # System manifest (includes the Art. 9 + MDR risk programme)
├── shared_data/policies/ # destination of the OSCAL assessment_plan generated by sei compile
├── dvc.yaml # Pipeline: 1 stage `evaluate` (in-memory, no CSV)
├── data/
│ ├── trusted_metadata.csv # Real acquisition metadata (TCIA via NBIA API)
│ └── cohort.croissant.json # Data governance §2 (Croissant, EU AI Act Art. 10)
└── eval/
├── eval.py # DVC stage: segments in memory + audits with SDK
├── segment.py # segment_cohort() — TotalSegmentator/GPU
└── dicom_utils.py # DICOM/SEG utilities

Below is an illustrative, abbreviated version of the manifest (the risks: block is summarised; the scenario file ships the full one). It is meant to convey the shape of a sei.yaml:

sei.yaml (abbreviated, illustrative)
apiVersion: seigarrena.dev/v1alpha1
kind: AISystem
system:
name: spine-segmentation
intended_purpose: "Segmentacion vertebral en TC sobre cohorte de validacion (eval-only)."
task:
modality: image
type: segmentation
eval: { script: eval/eval.py }
pipeline: { tool: dvc, metrics: metrics.json }
oscal: { assessment_plan: shared_data/policies/assessment_plan.oscal.yaml }
dataset: { croissant: data/cohort.croissant.json }
artifacts:
model: { kind: totalsegmentator, seed: 0 }
risk:
appetite: { individual: HIGH, society: HIGH, organization: HIGH }
criteria: { scale: "5x5" }
risks:
# 4 risks with their measures (each carrying MDR frameworks) — see table below

The appetite is set to HIGH for all three dimensions: the device operates in a high-risk environment (MDR) where the appetite is higher than in a financial system. The gate controls (global Dice, equity by sex/age, robustness by scanner) must exceed the clinical threshold for the gate to be GREEN; the audit controls (such as the worst subgroup) record their value but do not block it.

The pipeline has a single evaluate stage that segments the cohort in memory (no intermediate CSV) and audits the results against the OSCAL assessment plan:

dvc.yaml (excerpt)
stages:
evaluate:
cmd: .venv/bin/python eval/eval.py
deps:
- eval/eval.py
- eval/segment.py
- eval/dicom_utils.py
- data/trusted_metadata.csv
- shared_data/policies/assessment_plan.oscal.yaml
metrics:
- metrics.json:
cache: false

The handoff between segmentation and auditing is in memory: segment_cohort() returns a DataFrame with a Dice column per patient, which the SDK evaluates as registry metrics (mean_score, worst_cell_score, es_dice, etc.) with power-stats clustered by patient.

The risk: section — 4 risks, dual regulation

Section titled “The risk: section — 4 risks, dual regulation”

The AssuranceProgram (the risk: section of sei.yaml) declares 4 risks with 14 measures. Each measure carries frameworks: [eu/mdr@2017#gspr-<n>] — the GSPR from MDR Annex I that corresponds to that control:

RiskKey measuresMDR GSPR
risk.subgroup-segmentation-biasworst-subgroup-dice (audit, critical — does not block), human-oversight-worst-case-review (HOTL, Art.14)GSPR-9 (risk control)
risk.segmentation-accuracy-failurevertebra-segmentation-dice (gate), data-leakage-check (gate)GSPR-17.1 (software performance)
risk.data-governance-failurefairness-sex-disparate-impact (es_dice, gate), privacy-k-anonymity (gate)GSPR-17.1 (data governance)
risk.scanner-protocol-biasrobustness-scanner-bias, fairness-scanner-model (both gate)GSPR-17.1 (multi-scanner performance)

worst-subgroup-dice has enforcement: audit (severity: critical but non-blocking): it records the worst subgroup and surfaces the residual honestly, but does not bring down the gate. The gate coverage of prEN 18228 cl. 9.2 comes from vertebra-segmentation-dice and fairness-sex-disparate-impact (both enforcement: gate).

The two post-market controls (postmarket-subgroup-drift, postmarket-dice-drift) and the HOTL oversight (human-oversight-worst-case-review) have lifecycle: [monitoring]: sei compile excludes them from the 11 ex-ante controls of the assessment plan (14 measures − 3 monitoring = 11).

See The three treatment modalities for the full regulatory framing of each risk.


Step 3 — sei compile: AssuranceProgram → OSCAL (11 ex-ante controls)

Section titled “Step 3 — sei compile: AssuranceProgram → OSCAL (11 ex-ante controls)”
Terminal
sei compile

sei compile reads the risk: section of sei.yaml, builds the AssuranceProgram in memory, and generates shared_data/policies/assessment_plan.oscal.yaml with the 11 ex-ante controls (excluding the two lifecycle: [monitoring] ones). This file is the dependency of the evaluate DVC stage; its digest enters dvc.lock and the signed bundle.

Terminal
git add shared_data/policies/assessment_plan.oscal.yaml
git commit -m "init medical REAL (TCIA cohort + TotalSegmentator GPU; online, no cohort_results.csv)"

See the sei CLI reference for details on the compile subcommand.


Step 4 — sei run: ex-ante evaluation with a GREEN gate and an unmet AUDIT control

Section titled “Step 4 — sei run: ex-ante evaluation with a GREEN gate and an unmet AUDIT control”
Terminal
sei run

sei run triggers dvc repro, which executes the evaluate stage:

  1. segment_cohort() passes all TCIA images through TotalSegmentator/GPU and returns a DataFrame with the real Dice per case
  2. The SDK (vl.enforce(data=df, cluster=PatientID, phase=validation)) evaluates the 11 assessment-plan controls with power-stats clustered by patient
  3. The engine applies the gate and signs the bundle

Real result documented: sei run returns exit 0 — GREEN gate. The model is globally fit (global Dice ≈0.90). The results are:

ControlOperatorThresholdResultEnforcementStatus
vertebra-segmentation-dice>0.85~0.90 (global)gatePASSES
data-leakage-check<0.99(no Dice≈1.0)gatePASSES
fairness-sex-disparate-impact>0.80fit (es_dice)gatePASSES
worst-subgroup-dice>0.75below threshold (worst cell)audit (non-blocking)UNMET (recorded)

The gate stays GREEN because all enforcement: gate controls pass (global Dice, equity by sex/age, robustness by scanner, k-anonymity). The worst-subgroup-dice control (MDR GSPR-9, severity: critical) fails on its point value —the worst composite subgroup falls on atypical cases— but it has enforcement: audit: it records the residual and surfaces it honestly in Annex IV §4, without bringing the gate down. The worst subgroup in a small cohort with atypical cases is a limitation of the BYO model, not a defect to be corrected with a gate; it is treated with human oversight (HOTL), not by gating to green.

Verify the signature and check the status:

Terminal
sei verify # OK — ECDSA-P256+DSSE+in-toto signature valid
sei status # exit 0 — GREEN gate (prints worst-subgroup-dice as an unmet AUDIT control)

sei status returns exit 0 (green gate) and prints worst-subgroup-dice because it is an unmet AUDIT control —so the residual stays visible— but it does not put the gate in the red.


Step 5 — Treating the subgroup residual: human HOTL oversight (Art.14)

Section titled “Step 5 — Treating the subgroup residual: human HOTL oversight (Art.14)”

The worst-subgroup-dice control fails on its point value, but it is audit (does not block): the gate is already GREEN. The subgroup residual is not treated with a parameter or code change —the model is BYO (TotalSegmentator), not retrained— but with a declared ex-ante organisational control: human-in-the-loop oversight (HOTL, Art.14).

In the risk: section of risk risk.subgroup-segmentation-bias, the treatment is method: REDUCE with controls: [eu/ai-act@2024#art-15, eu/ai-act@2024#art-14], and the confirming control is:

sei.yaml (excerpt of the subgroup risk)
- id: human-oversight-worst-case-review
metric: hotl_worst_case_review_declared
constraint: "> 0"
severity: high
enforcement: audit
evaluation: manual
lifecycle: [monitoring]
article: "14"
frameworks: [eu/mdr@2017#gspr-17.1]
standard_clauses: ["eu/pren-18228@2026#14", "iso/42105@2026#intervention"]

A clinician validates the segmentation of the worst-subgroup cases before downstream use (surgical planning). The control is declared ex-ante in the manifest and operates post-market (Phase 2, lifecycle: [monitoring]): that is why sei compile excludes it from the 11 ex-ante controls, but it is recorded as a means of conformance for Art.14.

Declaring that oversight is the ISO 23894 §6.5 treatment: a versioned change in sei.yaml, with authorship and timestamp, documenting the proportionate response to the subgroup residual.

Terminal
git commit -m "treat(medical): declare HOTL oversight (Art.14) for worst-subgroup cases"

See The three treatment modalities for the regulatory framing of the human-oversight modality.


Step 6 — Scenario status: GREEN gate, residual treated

Section titled “Step 6 — Scenario status: GREEN gate, residual treated”
Terminal
sei verify # OK — signed bundle, GREEN gate
sei status # exit 0 — GREEN gate; worst-subgroup-dice listed as an unmet AUDIT control

The medical.json scenario ends here: the gate is GREEN, the model is globally fit (global Dice ≈0.90), and the subgroup residual is recorded (by the audit control worst-subgroup-dice) and treated (by the declared HOTL oversight). There is no FAILS→PASSES arc from a parameter adjustment: the gate was never red. The ISO 23894 §6.5 cycle for risk risk.subgroup-segmentation-bias is closed by the HOTL treatment, not by re-measurement.


Step 7 — sei conformance: MDR + EU AI Act projection

Section titled “Step 7 — sei conformance: MDR + EU AI Act projection”
Terminal
sei conformance --standard eu/pren-18228@2026 --out

The same bundle projects onto prEN 18228 (the harmonised risk management standard, Art. 9):

ClauseResultWhy
cl. 9.2 (control measures)CoveredProvided by the gate controls vertebra-segmentation-dice and fairness-sex-disparate-impact (enforcement: gate, aligned with cl. 9.2); worst-subgroup-dice is audit (confirming, does not provide the gate coverage)
cl. 10 (global residual)Covered / GapThe engine aggregates the global residual (evaluate_overall_residual()) and sei conformance cl.10 emits COVERED or GAP; it is only advisory in the sei run gate
Terminal
sei conformance --standard iso/23894@2023

ISO 23894 projection: clause §6.5 is covered when the cycle of risk.subgroup-segmentation-bias is closed (here, by the declared HOTL treatment).

The frameworks: [eu/mdr@2017#gspr-9] field in the risk: section causes the bundle to include MDR traceability in the conformance reports without any additional annotation.

The Annex IV (Art. 11) is not emitted by the engine: it is assembled and rendered by the cloud (control plane) from the signed bundle.json, with provenance per field, and delivered as a PDF. There is no sei annex-iv subcommand. The cloud can mark §2/§3/§4/§9 PENDING if their input is missing (data / misuse+residuals+Art.14 / control results / post-market measures); §7 and §8 are never pending. In this scenario, with data and controls present, only §3 and §9 would stay pending. See Guide: Render the Annex IV.


Contrast with tutoriales/primer-sistema (loan)

Section titled “Contrast with tutoriales/primer-sistema (loan)”
Dimensionloan (credit)medical (medical imaging)
Regulatory frameworkEU AI Act + DORA (financial entity)EU AI Act + MDR (SaMD)
Treatment modalityCode change (train.py V1→V2, fairlearn)Human HOTL oversight (Art.14, declared organisational control)
ModelLogReg, trained in-pipelineTotalSegmentator v2, BYO (no fine-tuning)
Primary metricdemographic_parity_diff (tabular)Volumetric Dice (image)
Equitydisparate_impact (binary, by gender)es_dice (ESSP, equity over continuous Dice)
Power-statsBootstrap per sample (n=1000)Bootstrap per patient cluster (n≈33 effective)
Gate at T0RED — global control fails (fairness)GREEN — all gate controls pass; the worst subgroup (audit) records the residual
Response to residualRetrain (Drift B: model changes)Declare HOTL oversight (organisational control, model and data unchanged)

The engine is the same in both scenarios. The Reproducer seam dispatches dvc repro in both cases. What changes is the treatment modality and the regulatory framework — declared in the AssuranceProgram via frameworks and projected in the conformance reports without modifying the engine.


  1. An AssuranceProgram with dual regulation (EU AI Act + MDR) declared per measure via frameworks
  2. 11 ex-ante controls generated by sei compile from 4 risks and 14 measures (excluding 3 with lifecycle: [monitoring])
  3. A GREEN gate on a real GPU cohort: all gate controls pass (global Dice ≈0.90); the worst subgroup (audit) records the residual without blocking
  4. The treatment via human HOTL oversight (Art.14): a declared ex-ante organisational control, not a parameter or code change
  5. Honest UNDERPOWERED: at an effective cohort ≈33 the bootstrap CIs are wide; the worst subgroup is modelled as audit precisely because at small n it can be noise
  6. Dual conformance projected over the same bundle: prEN 18228 + ISO 23894
GapWhy it matters
HOTL declared ex-ante, post-market operation without live dataThe human-oversight-worst-case-review control documents the plan; the clinical-review evidence is runtime (Phase 2)
NSD / HD95 as cross-checkThe Dice size-bias caveat (arXiv 2509.19778) requires cross-checking with boundary metrics; future line
Post-market controls (postmarket-subgroup-drift)Ex-ante only in v1; Art. 72 not modelled in the engine
Annex IV §3/§9 PENDING in this scenario§2/§3/§4/§9 can be pending if their input is missing; here only §3 and §9 (depend on the deployer’s environment); §7 and §8 are never pending

See State and incompleteness for the complete gap map.