The red/green loop (measure vs threshold)

🟡 Partial — The statistical power caveat (UNDERPOWERED) is advisory for the run gate in v1: it is flagged and does not stop the run, but the control is not counted as green — it stays INCONCLUSIVE.

A control associates a metric with a threshold and a blocking condition. The state of the control — red or green — is the state of the treatment cycle for that risk.

What defines a control

A control consists of three elements:

Metric — the measured value (demographic parity difference, fairness metric, test coverage, etc.)
Threshold — the limit value declared in froga.yaml, derived from the risk appetite
Mode — blocking (brings down the gate on failure) or advisory (flags the failure, does not block)

The risk gate is the conjunction of all blocking controls. If any is red, froga run returns exit ≠ 0 even though the evidence is still anchored and signed.

Real example: the loan and demographic parity

The loan scenario (consumer credit, EU AI Act Annex III §5(b)) measures the demographic parity difference (demographic_parity_diff, difference in approval rates between gender groups) as the fairness control, on the HELD-OUT split (test, n=200) — not in-sample (M-11). The gate passes on the point estimate if < 0.092 (a declared bound, not pinned to the observed value).

In the demonstrated run:

measured metric:   demographic_parity_diff ≈ 0.046 (held-out, n=200)
declared threshold: 0.092
point estimate: passes (0.046 < 0.092)

However, the engine computes a 95% cluster-aware bootstrap confidence interval:

bootstrap CI [0.003, 0.174]

The upper bound of the CI — 0.174 — crosses the threshold 0.092. The point estimate passes the bound, but the sample size (n=200) is insufficient to distinguish the estimator from the threshold: the engine marks the control INCONCLUSIVE and emits the UNDERPOWERED warning (M-1), not a green. This is why the loan arc is no longer “red→green” but “red→INCONCLUSIVE”, and the system verdict is GAP.

Statistical reliability: CI, not just point estimates

Reporting only the point estimate is insufficient practice for high-risk systems: a wide CI can conceal that a system “passes” the threshold by statistical chance.

The froga engine reports confidence intervals per control:

Cluster-aware bootstrap — when observations are not independent (e.g., multiple images per patient in the medical case), the bootstrap is stratified by cluster (patient) to produce a CI that respects intra-cluster dependence.
Crossed threshold — if the CI crosses the threshold, the engine emits UNDERPOWERED with the explicit CI, regardless of the point estimate.

This pattern applies to both the loan scenario (by sample) and the medical scenario (by patient).

The “refactor” is the treatment

When a control is red, the next step is to choose the lowest-cost treatment that brings the metric to green. The available options — code change, parameter adjustment, dataset change — are described in Treatment modalities.

The treatment committed to git is the act that re-measures the control on the next froga run. When the held-out evidence is sufficient the control turns green; when it is underpowered (as in loan, held-out n=200) the control turns INCONCLUSIVE rather than green and the cycle does not close in green.