Every number on this page is verifiable. The raw data, benchmark scripts, and calibration API are publicly accessible. Raw data: /api/predictions/calibration
Three calibration metrics, each with a distinct gold standard and a distinct interpretation. They are not interchangeable — read the caveats before citing.
Computed in src/utils/prediction-tracker.js → getCalibration(). The formula and inclusion criteria are fixed — not tuned post-hoc.
Resolved outcome required. Included only when outcomeVerdict = 'validated' (completed CT.gov trial found) or outcomeVerdict = 'insufficient_data' (checked, no trial found). Predictions with trial_active status are excluded — outcome unknown.
Predicted probability = normalized confidence score. Each analysis assigns a repurposing score (0–100) via calculateRepurposingScore() in src/ai/models/drug-repurposing-engine.js, clamped to [0, 1]. Scores reflect 6 weighted factors: target overlap, clinical evidence, MOA alignment, disease context, pathway enrichment, safety profile.
Soft outcome labels. Validated predictions receive a graded label based on the trial's extracted result class (see table below), reducing sensitivity to false positives from trials with mixed or neutral results.
Date range: Predictions span from GaiaLab's earliest recorded analysis (early 2025) through the current date. The N grows weekly as npm run predictions:check runs ClinicalTrials.gov lookups. Live count always at /api/predictions/calibration.
| outcomeVerdict | completedOutcomeClass | Label yᵢ | Meaning |
|---|---|---|---|
validated | positive | 1.0 | Trial completed; positive result for this drug–disease direction |
validated | mixed | 0.5 | Trial completed; mixed or inconclusive results |
validated | neutral | 0.25 | Trial completed; no detectable signal either direction |
validated | negative | 0.0 | Trial completed; negative result |
insufficient_data | — | 0.0 | Checked; no completed trial found (treated as negative) |
Generated live from /api/predictions/calibration. Shows the observed trial completion rate for predictions grouped by confidence decile. A perfectly calibrated model tracks the dashed diagonal — every bar at the same height as the diagonal line.
ECE is the weighted average absolute deviation of predicted confidence from observed rates across decile bins. Where Brier is mean squared error, ECE is mean absolute error in probability space — more interpretable as "how far off, on average."
AUROC computed via Mann-Whitney U statistic in scripts/benchmark-auroc.js. Answers: "given a random positive–negative pair, how often does the model rank the positive higher?" 0.50 = random; 1.0 = perfect discrimination.
outcomeVerdict = validated (completed CT.gov trial found). Negative = outcomeVerdict = insufficient_data (checked, no trial found). Active and unchecked predictions excluded. Bootstrap CI over 1,000 resamples.
| Disease | N | + rate | AUROC | AUPRC | P@10 |
|---|---|---|---|---|---|
| Loading… | |||||
stale in data/eval/auroc-retrospective.json due to an April 2026 bulk backfill; recomputing would not produce a comparable metric. The March 2026 snapshot is the citable baseline.
This is not clinical validation. A completed CT.gov trial means investigators chose to study this drug–disease pair. It does not imply the drug works, is approved, or is safe for any specific indication. Completed ≠ proven efficacy.
High-confidence GaiaLab scores are computational triage signals. Evidence-grounded starting points for literature review, experimental design, and grant framing — not endpoints. The appropriate follow-up is systematic literature review, not a clinical decision.
Retrospective concordance is the honest baseline. A system that independently ranks drugs already under clinical investigation is demonstrating biological plausibility, not predicting unknown outcomes. The Brier score and retrospective AUROC both use this standard.
Prospective concordance is what matters scientifically. GaiaLab timestamps every prediction. When a new trial registers after a prediction, that is a prospective match — the only fair test of forward predictive ability. See /new-trials for the live classified ledger and the pipeline of 330+ unmatched predictions entering their 30–90 day window.
All calibration data is publicly accessible. These endpoints return the exact numbers shown on this page. No authentication required.