Calibration Methodology · Openly Published

How GaiaLab measures
prediction accuracy

Every number on this page is verifiable. The raw data, benchmark scripts, and calibration API are publicly accessible. Raw data: /api/predictions/calibration

📅 Metrics refreshed live 📊 N=498 resolved predictions 🔬 22 disease areas
1
What GaiaLab measures

Three calibration metrics, each with a distinct gold standard and a distinct interpretation. They are not interchangeable — read the caveats before citing.

0.186
Brier Score
vs 0.25 no-skill baseline · lower is better
Mean squared error of confidence scores vs real-world trial outcome labels (N=498 resolved predictions). Read alongside ECE — a low Brier with high ECE reflects class imbalance, not proven calibration.
0.545
AUROC Retrospective
95% CI [0.526, 0.562] · random = 0.50
Ranks candidates above non-candidates across 22 disease areas (March 2026 snapshot, N=529). Gold standard: ClinicalTrials.gov completed trial match.
Modest signal above random. CI is narrow and above 0.50, but limited by high positive-label rates in high-activity disease areas.
0.90
AUROC Prospective
approval-year holdout · 4,446 prediction–label pairs
Drug scoring against CT.gov trial match labels using approval year as a temporal cutoff. 22 disease areas; 76% positive base rate.
Bootstrap 95% CI [0.526, 0.662] does not contain the point estimate — reflects instability at very high positive base rates. Interpret cautiously.
Temporal classification: Retrospective = trial existed before the prediction timestamp. Prospective = prediction precedes trial registration. Only prospective matches constitute forward predictions. The Brier score is agnostic to temporal direction. See /new-trials for the live classified ledger.
2
Brier score — exact methodology

Computed in src/utils/prediction-tracker.js → getCalibration(). The formula and inclusion criteria are fixed — not tuned post-hoc.

// Brier Score (lower = better; 0 = perfect; 0.25 = no-skill uniform predictor)
BS = (1 / N) × Σᵢ ( predicted_probabilityᵢ − outcomeᵢ )²

// No-skill baseline: p = 0.5 for every prediction regardless of evidence
BS_random ≈ 0.25   (uniform 50% predictor, binary outcomes)

Inclusion criteria (retrospective AUROC benchmark, N=529)

  • Resolved outcome required. Included only when outcomeVerdict = 'validated' (completed CT.gov trial found) or outcomeVerdict = 'insufficient_data' (checked, no trial found). Predictions with trial_active status are excluded — outcome unknown.

  • Predicted probability = normalized confidence score. Each analysis assigns a repurposing score (0–100) via calculateRepurposingScore() in src/ai/models/drug-repurposing-engine.js, clamped to [0, 1]. Scores reflect 6 weighted factors: target overlap, clinical evidence, MOA alignment, disease context, pathway enrichment, safety profile.

  • Soft outcome labels. Validated predictions receive a graded label based on the trial's extracted result class (see table below), reducing sensitivity to false positives from trials with mixed or neutral results.

  • Date range: Predictions span from GaiaLab's earliest recorded analysis (early 2025) through the current date. The N grows weekly as npm run predictions:check runs ClinicalTrials.gov lookups. Live count always at /api/predictions/calibration.

Outcome label encoding

outcomeVerdictcompletedOutcomeClassLabel yᵢMeaning
validatedpositive1.0Trial completed; positive result for this drug–disease direction
validatedmixed0.5Trial completed; mixed or inconclusive results
validatedneutral0.25Trial completed; no detectable signal either direction
validatednegative0.0Trial completed; negative result
insufficient_data0.0Checked; no completed trial found (treated as negative)
Why soft labels? Collapsing to binary (trial found / not found) would conflate a drug with overwhelming Phase III evidence and one with a terminated trial. Soft labels encode outcome quality and align the Brier score with clinical utility, not just concordance.
Live calibration JSON ↗
3
Reliability diagram — reading the chart

Generated live from /api/predictions/calibration. Shows the observed trial completion rate for predictions grouped by confidence decile. A perfectly calibrated model tracks the dashed diagonal — every bar at the same height as the diagonal line.

Loading calibration data…
≤5% off perfect calibration Overconfident Underconfident — — Perfect calibration diagonal
What GaiaLab's chart shows: Well-calibrated at high confidence (70%+) — those buckets track close to the diagonal. Mid-range predictions (40–70%) are systematically underconfident — the model assigns lower probability than observed trial completion rates suggest. This is a known artifact of conservative scoring in uncertain disease contexts and is being recalibrated using Platt scaling on the retrospective dataset.

Underconfidence vs overconfidence: An underconfident model is safer for research use — it under-promises. Researchers who act on a 50% confidence score and find a positive signal are not misled; they are pleasantly surprised.
4
ECE — Expected Calibration Error

ECE is the weighted average absolute deviation of predicted confidence from observed rates across decile bins. Where Brier is mean squared error, ECE is mean absolute error in probability space — more interpretable as "how far off, on average."

ECE = Σb ( |Bb| / N ) × | avg_confidenceb − observed_rateb |
// B_b = set of predictions in decile bin b; N = total predictions with outcomes
25.9%
ECE (all predictions)
On average 25.9 percentage points from perfect calibration across all deciles. Driven primarily by underconfident mid-range (40–70%) buckets.
55.5%
ECE Prospective
Computed only on prospective matches. Higher ECE is structurally expected — prospective outcomes may not be observable for months or years after prediction.
Why ECE prospective is high: A prediction made today is measured against a trial registration that may not occur for 6–24 months. The prospective ECE denominator is small and labels are sparse. As the prospective dataset grows toward hundreds of resolved matches, ECE prospective will converge toward the retrospective figure. This is not a flaw — it is honest reporting of the current state.
5
AUROC — per-disease breakdown

AUROC computed via Mann-Whitney U statistic in scripts/benchmark-auroc.js. Answers: "given a random positive–negative pair, how often does the model rank the positive higher?" 0.50 = random; 1.0 = perfect discrimination.

Label definition: Positive = outcomeVerdict = validated (completed CT.gov trial found). Negative = outcomeVerdict = insufficient_data (checked, no trial found). Active and unchecked predictions excluded. Bootstrap CI over 1,000 resamples.
DiseaseN+ rateAUROCAUPRCP@10
Loading…
Base rate caveat: Disease areas with positive rates above 90% (e.g., glioblastoma 99.6%) have limited AUROC discriminability — a near-trivial classifier scores near the positive rate. The March 2026 retrospective snapshot is marked stale in data/eval/auroc-retrospective.json due to an April 2026 bulk backfill; recomputing would not produce a comparable metric. The March 2026 snapshot is the citable baseline.
6
What this means for researchers
Reading the Brier score (live value above) — and its ECE caveat:

· A Brier below the 0.25 no-skill baseline beats random, but when ECE is high the low Brier is driven by class imbalance in the resolved set, not by calibrated confidence. Always read the two together.
· Predictions with 70%+ confidence match completed trials at meaningful rates — actionable for hypothesis prioritization.
· Mid-range predictions (40–70%) are underconfident — actual trial completion rates exceed what the model predicts. Treat as broader candidate lists, not tightly ranked signals.
· Low-confidence scores (below 40%) are exploratory only — biological plausibility is flagged, not clinical precedent.
  • This is not clinical validation. A completed CT.gov trial means investigators chose to study this drug–disease pair. It does not imply the drug works, is approved, or is safe for any specific indication. Completed ≠ proven efficacy.

  • High-confidence GaiaLab scores are computational triage signals. Evidence-grounded starting points for literature review, experimental design, and grant framing — not endpoints. The appropriate follow-up is systematic literature review, not a clinical decision.

  • Retrospective concordance is the honest baseline. A system that independently ranks drugs already under clinical investigation is demonstrating biological plausibility, not predicting unknown outcomes. The Brier score and retrospective AUROC both use this standard.

  • Prospective concordance is what matters scientifically. GaiaLab timestamps every prediction. When a new trial registers after a prediction, that is a prospective match — the only fair test of forward predictive ability. See /new-trials for the live classified ledger and the pipeline of 330+ unmatched predictions entering their 30–90 day window.

7
Raw data access & reproducibility

All calibration data is publicly accessible. These endpoints return the exact numbers shown on this page. No authentication required.

GET/api/predictions/calibrationFull JSON: Brier, ECE, reliability curve by decile
GET/api/new-trialsAll concordance matches with timestamps and outcomes
GET/api/stats/publicSummary statistics (counts, recruiting, Brier headline)

Benchmark scripts (reproducible)

SCRIPTscripts/benchmark-auroc.jsAUROC/AUPRC — internal CT.gov or OpenTargets gold standard
SCRIPTscripts/gaialab-eval.jsFull evaluation: NDCG@10, paired t-test, grounding rate
FILEdata/eval/auroc-retrospective.jsonMarch 2026 snapshot — citable baseline
FILEdata/eval/auroc-prospective.jsonProspective AUROC (approval-year temporal holdout)
Peer review status: Not yet externally peer-reviewed. Methodology is preprint-ready. To reproduce these numbers or review this work, contact partnerships@gailabai.com.

Citation format (draft): Idiakhoa O. (2026). GaiaLab: evidence-grounded computational drug triage with published calibration. Brier Score and ECE reported live (see current values above), AUROC 0.545 (22 disease areas). gailabai.com/calibration. (Cite the Brier/ECE/N shown live at access time — the resolved set grows weekly.)
Calibration JSON ↗ Concordance Ledger →