GaiaLab Calibration Methodology — Brier & ECE Published Openly

1

What GaiaLab measures

Three calibration metrics, each with a distinct gold standard and a distinct interpretation. They are not interchangeable — read the caveats before citing.

0.186

Brier Score

vs 0.25 no-skill baseline · lower is better

Mean squared error of confidence scores vs real-world trial outcome labels (N=498 resolved predictions). Read alongside ECE — a low Brier with high ECE reflects class imbalance, not proven calibration.

0.545

AUROC Retrospective

95% CI [0.526, 0.562] · random = 0.50

Ranks candidates above non-candidates across 22 disease areas (March 2026 snapshot, N=529). Gold standard: ClinicalTrials.gov completed trial match.

Modest signal above random. CI is narrow and above 0.50, but limited by high positive-label rates in high-activity disease areas.

Pending

AUROC Prospective

pre-registered lockbox · readout from ~Sep 2026

A reliable prospective AUROC is not yet available. An approval-year proxy scores 0.90, but that number is not trustworthy: a 76% positive base rate inflates it, and its own bootstrap 95% CI [0.526, 0.662] does not even contain 0.90. The genuine test is the pre-registered, sha256-sealed prospective lockbox (frozen 2026-06-28), scored after trial outcomes accrue.

Do not cite 0.90 as prospective accuracy — it is a base-rate artifact. See /validation for the sealed protocol and readout schedule.

Temporal classification: Retrospective = trial existed before the prediction timestamp. Prospective = prediction precedes trial registration. Only prospective matches constitute forward predictions. The Brier score is agnostic to temporal direction. See /new-trials for the live classified ledger.

2

Brier score — exact methodology

Computed in src/utils/prediction-tracker.js → getCalibration(). The formula and inclusion criteria are fixed — not tuned post-hoc.

// Brier Score (lower = better; 0 = perfect; 0.25 = no-skill uniform predictor)
BS = (1 / N) × Σᵢ ( predicted_probabilityᵢ − outcomeᵢ )²

// No-skill baseline: p = 0.5 for every prediction regardless of evidence
BS_random ≈ 0.25 (uniform 50% predictor, binary outcomes)

Inclusion criteria (retrospective AUROC benchmark, N=529)

Resolved outcome required. Included only when outcomeVerdict = 'validated' (completed CT.gov trial found) or outcomeVerdict = 'insufficient_data' (checked, no trial found). Predictions with trial_active status are excluded — outcome unknown.
Predicted probability = normalized confidence score. Each analysis assigns a repurposing score (0–100) via calculateRepurposingScore() in src/ai/models/drug-repurposing-engine.js, clamped to [0, 1]. Scores reflect 6 weighted factors: target overlap, clinical evidence, MOA alignment, disease context, pathway enrichment, safety profile.
Soft outcome labels. Validated predictions receive a graded label based on the trial's extracted result class (see table below), reducing sensitivity to false positives from trials with mixed or neutral results.
Date range: Predictions span from GaiaLab's earliest recorded analysis (early 2025) through the current date. The N grows weekly as npm run predictions:check runs ClinicalTrials.gov lookups. Live count always at /api/predictions/calibration.

Outcome label encoding

outcomeVerdict	completedOutcomeClass	Label yᵢ	Meaning
`validated`	`positive`	1.0	Trial completed; positive result for this drug–disease direction
`validated`	`mixed`	0.5	Trial completed; mixed or inconclusive results
`validated`	`neutral`	0.25	Trial completed; no detectable signal either direction
`validated`	`negative`	0.0	Trial completed; negative result
`insufficient_data`	—	0.0	Checked; no completed trial found (treated as negative)

Why soft labels? Collapsing to binary (trial found / not found) would conflate a drug with overwhelming Phase III evidence and one with a terminated trial. Soft labels encode outcome quality and align the Brier score with clinical utility, not just concordance.

Live calibration JSON ↗

3

Reliability diagram — reading the chart

Generated live from /api/predictions/calibration. Shows the observed trial completion rate for predictions grouped by confidence decile. A perfectly calibrated model tracks the dashed diagonal — every bar at the same height as the diagonal line.

Loading calibration data…

■ ≤5% off perfect calibration ■ Overconfident ■ Underconfident — — Perfect calibration diagonal

What GaiaLab's chart shows: Well-calibrated at high confidence (70%+) — those buckets track close to the diagonal. Mid-range predictions (40–70%) are systematically underconfident — the model assigns lower probability than observed trial completion rates suggest. This is a known artifact of conservative scoring in uncertain disease contexts and is being recalibrated using Platt scaling on the retrospective dataset.

Underconfidence vs overconfidence: An underconfident model is safer for research use — it under-promises. Researchers who act on a 50% confidence score and find a positive signal are not misled; they are pleasantly surprised.

4

ECE — Expected Calibration Error

ECE is the weighted average absolute deviation of predicted confidence from observed rates across decile bins. Where Brier is mean squared error, ECE is mean absolute error in probability space — more interpretable as "how far off, on average."

ECE = Σ_b ( |B_b| / N ) × | avg_confidence_b − observed_rate_b |
// B_b = set of predictions in decile bin b; N = total predictions with outcomes

25.9%

ECE (all predictions)

On average 25.9 percentage points from perfect calibration across all deciles. Driven primarily by underconfident mid-range (40–70%) buckets.

55.5%

ECE Prospective

Computed only on prospective matches. Higher ECE is structurally expected — prospective outcomes may not be observable for months or years after prediction.

Why ECE prospective is high: A prediction made today is measured against a trial registration that may not occur for 6–24 months. The prospective ECE denominator is small and labels are sparse. As the prospective dataset grows toward hundreds of resolved matches, ECE prospective will converge toward the retrospective figure. This is not a flaw — it is honest reporting of the current state.

5

AUROC — per-disease breakdown

AUROC computed via Mann-Whitney U statistic in scripts/benchmark-auroc.js. Answers: "given a random positive–negative pair, how often does the model rank the positive higher?" 0.50 = random; 1.0 = perfect discrimination.

Label definition: Positive = outcomeVerdict = validated (completed CT.gov trial found). Negative = outcomeVerdict = insufficient_data (checked, no trial found). Active and unchecked predictions excluded. Bootstrap CI over 1,000 resamples.

Disease	N	+ rate	AUROC	AUPRC	P@10
Loading…

Base rate caveat: Disease areas with positive rates above 90% (e.g., glioblastoma 99.6%) have limited AUROC discriminability — a near-trivial classifier scores near the positive rate. The March 2026 retrospective snapshot is marked stale in data/eval/auroc-retrospective.json due to an April 2026 bulk backfill; recomputing would not produce a comparable metric. The March 2026 snapshot is the citable baseline.

6

What this means for researchers

Reading the Brier score (live value above) — and its ECE caveat:

· A Brier below the 0.25 no-skill baseline beats random, but when ECE is high the low Brier is driven by class imbalance in the resolved set, not by calibrated confidence. Always read the two together.
· Predictions with 70%+ confidence match completed trials at meaningful rates — actionable for hypothesis prioritization.
· Mid-range predictions (40–70%) are underconfident — actual trial completion rates exceed what the model predicts. Treat as broader candidate lists, not tightly ranked signals.
· Low-confidence scores (below 40%) are exploratory only — biological plausibility is flagged, not clinical precedent.

This is not clinical validation. A completed CT.gov trial means investigators chose to study this drug–disease pair. It does not imply the drug works, is approved, or is safe for any specific indication. Completed ≠ proven efficacy.
High-confidence GaiaLab scores are computational triage signals. Evidence-grounded starting points for literature review, experimental design, and grant framing — not endpoints. The appropriate follow-up is systematic literature review, not a clinical decision.
Retrospective concordance is the honest baseline. A system that independently ranks drugs already under clinical investigation is demonstrating biological plausibility, not predicting unknown outcomes. The Brier score and retrospective AUROC both use this standard.
Prospective concordance is what matters scientifically. GaiaLab timestamps every prediction. When a new trial registers after a prediction, that is a prospective match — the only fair test of forward predictive ability. See /new-trials for the live classified ledger and the pipeline of 330+ unmatched predictions entering their 30–90 day window.

7

Raw data access & reproducibility

All calibration data is publicly accessible. These endpoints return the exact numbers shown on this page. No authentication required.

GET/api/predictions/calibrationFull JSON: Brier, ECE, reliability curve by decile

GET/api/new-trialsAll concordance matches with timestamps and outcomes

GET/api/stats/publicSummary statistics (counts, recruiting, Brier headline)

Benchmark scripts (reproducible)

SCRIPTscripts/benchmark-auroc.jsAUROC/AUPRC — internal CT.gov or OpenTargets gold standard

SCRIPTscripts/gaialab-eval.jsFull evaluation: NDCG@10, paired t-test, grounding rate

FILEdata/eval/auroc-retrospective.jsonMarch 2026 snapshot — citable baseline

FILEdata/eval/auroc-prospective.jsonProspective AUROC (approval-year temporal holdout)

Peer review status: Not yet externally peer-reviewed. Methodology is preprint-ready. To reproduce these numbers or review this work, contact partnerships@gailabai.com.

Citation format (draft): Idiakhoa O. (2026). GaiaLab: evidence-grounded computational drug triage with published calibration. Brier Score and ECE reported live (see current values above), AUROC 0.545 (22 disease areas). gailabai.com/calibration. (Cite the Brier/ECE/N shown live at access time — the resolved set grows weekly.)

Calibration JSON ↗ Concordance Ledger →

How GaiaLab measuresprediction accuracy

Inclusion criteria (retrospective AUROC benchmark, N=529)

Outcome label encoding

Benchmark scripts (reproducible)

How GaiaLab measures
prediction accuracy