Methodology

Step 1

Analysis Pipeline

Every GaiaLab analysis runs through five sequential stages. Stages 1 and 2 are fully parallel across all sources.

Gene normalisation

Input gene symbols are normalised to HGNC approved symbols. Aliases (e.g. HER2 → ERBB2) are resolved before any database query. Invalid symbols are flagged and excluded from scoring but included in the report.

Parallel data fetch — 75+ databases¹

All database queries run simultaneously via Promise.allSettled(). No source blocks another. A timeout or API error in one source does not prevent results from the remaining sources. Each client returns partial results on failure rather than throwing.

Channel aggregation

Raw API responses are aggregated into 16 evidence channels by domain-specific aggregators. Each aggregator applies source-specific normalisation, deduplication, and confidence flags before passing data downstream.

Scoring and classification

Drug candidates are scored 0–100 across six weighted factors. Pathways are ranked by FDR-corrected enrichment p-value. Hypotheses are filtered by evidence quality and cross-deduplicated against input gene tokens.

6-role AI debate

Six structured AI roles — Hypothesis, Critic, Evidence, Risk, Innovation, Synthesis — debate the scored data. Generative roles run on DeepSeek V3; critical/evaluative roles run on Anthropic Claude. Each role receives grounded prompts seeded with scored outputs from stage 4, not raw database dumps. The Synthesis role produces the final executive brief.

Data

Data Sources

75+ databases queried per analysis across seven domains.¹ All clients use Promise.allSettled() — a failure in any source does not block results from others.

Gene annotation & variation

Source	Data type	Auth
HGNC	Approved symbol, aliases, gene family	No
NCBI Gene	Entrez ID, summary, RefSeq	Optional (rate limit)
Ensembl	Stable ID, biotype, cross-references	No
UniProt	Protein function, variants, subcellular location, PTMs	No
ClinVar	Pathogenic/benign variant classifications	No
ClinGen	Gene-disease validity, haploinsufficiency	No
gnomAD (variant)	Population allele frequencies, constraint metrics	No
gnomAD (constraint)	pLI, LOEUF, missense Z-score	No
gnomAD (ancestry)	Ancestry-stratified allele counts	No
GWAS Catalog	Trait associations, lead SNPs, p-values	No
OMIM	Mendelian disease associations	No
Monarch Initiative	Cross-species phenotype associations	No
VEP (Ensembl)	Variant effect predictions	No
AGR (Alliance)	Cross-model-organism gene data	No

Pathway & functional annotation

Source	Data type	Auth
KEGG	Pathway membership, module associations	No
Reactome	Hierarchical pathway enrichment	No
Gene Ontology	BP, MF, CC terms	No
Enrichr	Gene set enrichment across 200+ libraries	No
MSigDB	Hallmark, C2, C6 gene sets	No
PathwayCommons	Merged pathway graph from multiple curated pathway databases	No
JASPAR	Transcription factor binding motifs	No
ChEA3	Transcription factor enrichment	No

Interaction & network

Source	Data type	Auth
STRING	Functional association network scores	No
STRING-DB partners	Physical interaction partners	No
BioGRID	PPIs, genetic interactions	Optional
IntAct	Curated molecular interactions, MI scores	No
ComplexPortal	Macromolecular complex membership	No
SynLethDB	Synthetic lethality pairs	No

Literature

Source	Data type	Auth
PubMed / NCBI Entrez	Citation metadata, MeSH terms, abstracts	Optional (3→10 req/s)
PMC Full-Text	JATS XML → quantitative extraction (IC50, HR, OR, n=, fold-change)	No
Europe PMC	Open-access full text, preprints	No
OpenAlex	Works, citations, author disambiguation	No
Semantic Scholar	Citation graph, influential papers	Optional
bioRxiv	Preprint titles and abstracts	No
Preprint monitor	New preprints matching panel genes (internal)	—

Drug, clinical & regulatory

Source	Data type	Auth
ChEMBL	IC50, EC50, Ki, pChEMBL values, mechanism of action	No
ClinicalTrials.gov v2	Active trials, phase, intervention, NCT IDs	No
OpenFDA	Adverse event counts, drug approval status	No
OncoKB	Oncology actionability tiers, variant-drug mappings	No
CIViC	Clinical interpretations of variants	No
DGIdb	Drug-gene interaction types and sources	No
DrugCentral	Drug targets, MOA, FDA labels	No
TTD	Therapeutic target database	No
PharmGKB	Pharmacogenomics annotations	No
PubChem (compound)	Structure, SMILES, InChI	No
PubChem (bioassay)	Bioactivity assay results	No
RxNorm DDI	Drug-drug interaction severity	No
OpenTargets	Disease-gene association scores (genetic, somatic, literature)	No
OpenTargets Genetics	QTL, GWAS colocalization, fine-mapping	No
Drug resistance intelligence	Known resistance mechanisms per drug class	—
FDA Regulatory	Label text, boxed warnings, indication	No
Patent status	Patent expiry year, exclusivity status	No
LINCS	Perturbation gene expression signatures (L1000)	No
DisGeNET	Gene-disease associations with evidence score	API key
DrugBank	Drug targets, pharmacokinetics, interactions	API key

Omics & cancer

Source	Data type	Auth
TCGA	Somatic mutation frequency, expression	No
TCGA survival	Survival stratification by mutation/expression	No
cBioPortal	Alteration frequency, mutation-aware survival stratification across TCGA cohorts	No
COSMIC Signatures	Mutational signature contributions	No
DepMap	Cancer dependency scores (CRISPR screen)	No
DepMap co-essentiality	Co-essential gene pairs across cell lines	No
GDSC	Drug sensitivity (IC50) across cancer cell lines	No
GTEx (expression)	Tissue-specific RNA expression	No
GTEx (eQTL)	Expression quantitative trait loci	No
HPA	Protein and RNA atlas, subcellular localisation	No
CPTAC	Proteogenomic abundance, phospho-state	No
ProteomicsDB	Human proteome expression	No
PRIDE	Mass spectrometry proteomics datasets	No
CELLxGENE	Single-cell RNA-seq cell-state annotations	No
HMDB	Metabolite-gene associations	No
MetaboLights	Metabolomics studies	No
Orphanet	Rare disease gene associations	No

Structural

Source	Data type	Auth
AlphaFold (EBI)	pLDDT per-residue confidence → druggability score	No
PDB	Experimental 3D structures, resolution	No

Sources marked "API key" fall back gracefully when credentials are absent — missing sources are disclosed in the analysis output, not filled with inferred data. Without DisGeNET and DrugBank keys, coverage is ~60/80 sources.

Pathway Analysis

FDR-Corrected Pathway Enrichment

GaiaLab uses a hypergeometric test for gene set enrichment, then applies Benjamini-Hochberg (BH) multiple testing correction across all tested pathways.

Hypergeometric test

P(X ≥ k) = Σ C(K,i)·C(N-K, n-i) / C(N,n) for i = k to min(n,K) Where: N = genome background size (21,000 protein-coding genes) K = genes in pathway (from database annotation) n = genes in input panel k = overlap between input panel and pathway

BH correction

Raw p-values across all pathways are ranked ascending. Each pathway receives an adjusted q-value:

q_i = p_i · (m / i) Where: m = total number of pathways tested i = rank of this pathway (1 = smallest p-value)

Pathways are labelled by significance tier:

high — q ≤ 0.01
moderate — q ≤ 0.05
nominal — q ≤ 0.10
ns — q > 0.10 (not shown by default)

Only pathways at q ≤ 0.05 are included in the executive brief and drug scoring. Pathways at q ≤ 0.10 are shown in the full pathway panel with a "nominal" label. This stricter threshold (tightened from q < 0.20) limits the expected false-discovery rate to 1-in-10 rather than 1-in-5.

Evidence Hygiene

Citation Verification & Hallucination Detection

GaiaLab runs a three-stage evidence integrity pipeline on every analysis to ensure cited literature is real, relevant, and accurately represented.

Stage 0 — PMID existence check

After every analysis completes, all PMIDs produced by the AI synthesis layer are batch-queried against the NCBI PubMed E-utilities esummary API in groups of 10. This check runs asynchronously — it does not add latency to analysis delivery. Any PMID not returned by PubMed's index is flagged as hallucinated and logged to data/quality/invalid-pmids.json. The hallucination rate is patched into the snapshot and reported on the Trust dashboard for runs from May 2026 forward. Historical runs prior to this date were not validated retroactively and show "Not checked" on the Trust dashboard.

Stage 1 — NLI entailment check

Claims from the analysis are verified against their cited abstract text using DeBERTa-v3-large (cross-encoder/nli-deberta-v3-large), a state-of-the-art Natural Language Inference model. The entailment score threshold is 0.5 — claims that score below this are flagged as weakly supported. Context window: 2,500 characters per passage, 400 characters per claim.

Stage 2 — ALCE-style cite metrics

Inspired by the ALCE attribution benchmark, GaiaLab computes cite-precision, cite-recall, and cite-F1 for each analysis:

cite-precision = claims with ≥1 supporting citation / total claims cite-recall = citations actually used / total citations provided cite-F1 = 2 × (precision × recall) / (precision + recall)

These metrics are shown on the Trust page. A cite-F1 ≥ 0.6 is considered well-grounded.

Multi-agent citation floor

Any insight produced by the 6-agent debate that has zero verified PMIDs is annotated with citationFloor: false and its evidence quality is capped at "moderate". A "⚠ No PMIDs" badge is shown on the insight card in the analysis output.

Drug Scoring

Relation-Aware Drug Scoring

Each drug candidate is scored 0–100 across six weighted factors, then classified into a tier and assigned a floor/cap based on regulatory status.

Scoring formula

finalScore = ( targetScore × 0.30 + // gene-drug target overlap in input panel (ChEMBL, DGIDb, OpenTargets) clinicalScore × 0.25 + // phase, trial status, FDA approval moaScore × 0.20 + // mechanism of action alignment to disease pathway contextScore × 0.12 + // disease co-mention in literature + OpenTargets association score pathwayScore × 0.08 + // enriched pathway membership overlap with drug targets safetyScore × 0.05 // OpenFDA adverse event burden; Lipinski / ADMET flags ) + Bonus signals (added after weighted sum): AlphaFold structural bonus: +0 to +10 (pLDDT ≥80 → +10, ≥70 → +6, ≥60 → +3) DepMap essentiality bonus: +0 to +8 (cancer dependency score in disease cell lines) Network proximity bonus: +0 to +5 (≤2 hops from panel genes in STRING/BioGRID) Penalties (single worst-case applies, not cascading): contextRelevance < 20 → score × 0.30 contextRelevance 20–34 → score × 0.45 (off-label, not on-panel only; on-label drugs are exempt) FDA-approved floor: On-label FDA-approved → floor = 70 (Tier I guaranteed) Off-label FDA-approved → floor = 35 (if contextRelevance > 0)

Tier classification

Tier I

Score ≥ 70. Strong evidence. On-panel target, clinical data, context match. Shown prominently in all views.

Tier II

Score 50–69. Moderate evidence. Includes all FDA-approved drugs that pass context filter. Up to 3 shown by default.

Tier III

Score < 50. Exploratory. Collapsed behind toggle. Requires explicit expansion by the user.

Filters applied before scoring

Context relevance ≥ 40 required for off-panel drugs (≥ 30 for on-panel)
Clinical evidence score ≥ 15 required for off-panel drugs
Synthetic lethality only computed in oncology disease contexts
Duplicate canonical drugs resolved by highest repurposingScore

Validation

Convergence Scoring

A drug scoring highly on one factor but appearing in no other source is less trustworthy than a drug supported by multiple independent evidence types. Convergence scoring counts how many of six orthogonal source families each drug passes:

Family	Passes when
`pubmed`	≥ 1 PMID linked to the drug–disease combination
`clinicaltrials`	≥ 1 trial record in ClinicalTrials.gov for this drug
`fda`	FDA approved, phase ≥ 3, or phase label matches "approved / phase 3 / phase 4"
`chembl`	Confirmed binding targets or bioactivity records exist in ChEMBL
`structural`	AlphaFold pLDDT ≥ 50, PDB structures present, or `hasAlphaFold=true`
`network`	≥ 3 interaction neighbours in STRING/BioGRID, or `hasNetworkProximity=true`

A convergence score of 4/6 or higher is displayed as a "convergent" badge on the drug card. This badge means the drug's ranking is supported by multiple orthogonal evidence lines, not just a single strong signal. The six families are intentionally independent — structural data cannot influence the PubMed or clinical trial checks.

Convergence scoring is a display and communication tool, not a re-ranking signal. It does not alter the six-factor score. Its purpose is to help researchers quickly identify candidates with broad multi-source support.

AI Synthesis

6-Role AI Debate

GaiaLab routes each analysis through six structured AI roles, each receiving scored data — not raw database text. Generative roles (Hypothesis, Evidence, Risk) run on DeepSeek V3; critical roles (Critic, Synthesis, Innovation) run on Anthropic Claude. Each role has a defined adversarial mandate and the iterative critique loop is genuine.

Hypothesis Agent

Generates mechanistic hypotheses from gene-pathway-drug co-occurrence patterns. Revises hypotheses in response to Critic flaws (iterative debate round).

Critic Agent

Identifies confounders, alternative explanations, and evidence gaps. Flags hypotheses that lack direct mechanistic support. Seeded with live OpenTargets and ChEMBL bioactivity data.

Evidence Agent

Assesses citation quality, recency, and quantitative support from PMC full-text extraction (IC50, HR, OR, n= values). Assigns grounding scores per claim.

Risk Agent

Evaluates safety signals from FDA FAERS adverse event counts and contraindication overlaps. Penalises drug candidates with high AE burden in the disease population.

Innovation Agent

Identifies novel angles — repurposing opportunities, combination hypotheses, and underexplored targets. Seeded with active ClinicalTrials.gov recruiting trials.

Synthesis Agent

Integrates debate outputs into the executive brief. Applies the advisory-therapeutic normaliser to ensure claim-level confidence aligns with citation coverage. Produces the final PMID evidence ledger.

Provider assignment per role

Generative roles (Hypothesis, Evidence, Risk) are routed to DeepSeek V3. Critical/evaluative roles (Critic, Synthesis, Innovation) are routed to Anthropic Claude. This ensures the debate reflects genuinely different training biases — not just structural roles on the same model. If a provider is unavailable, calls fall back in order: DeepSeek → OpenAI → Google Gemini → Anthropic Claude.

When the environment variable GAIALAB_MULTI_AGENT_FORCE_MODEL is set, all six roles are collapsed to a single provider and per-role routing is disabled. This is the behaviour on deployments where only one AI key is configured. The Trust page's "AI Provider Health" section shows which providers are currently active for a given run.

Trust

Confidence Tiers

Claim-level confidence is capped by citation coverage. AI-generated language cannot assert high confidence when the citation record does not support it.

Confidence	Requirement	Display
High	On-panel target AND clinical evidence score ≥ 15 AND ≥ 6 PubMed citations	Green border, "strong evidence" label
Medium	2–5 citations OR off-panel with clinical data	Blue border, "moderate evidence" label
Low	< 2 citations OR hypothesis only	Grey border, "exploratory" label

Every cited claim includes a PMID. Claims without PMIDs are labelled "derived" or "hypothetical" and rendered with reduced visual prominence. This is enforced by the PMID evidence ledger, not by AI instruction — AI cannot override it.

Integration

MCP Server Interface

GaiaLab exposes a Model Context Protocol (MCP) server at POST /mcp, allowing AI assistants — including Claude Desktop and custom agents built with the Anthropic Agent SDK — to call the full analysis pipeline as a tool.

Tool: `gaialab_generate_insights`

Input (Zod-validated JSON): { "genes": string[] // HGNC gene symbols, e.g. ["BRCA1", "TP53", "PTEN"] "diseaseContext": string // e.g. "triple-negative breast cancer" "audience": string // "researcher" | "clinician" | "general" } Output: full structured analysis JSON including - pathway enrichment with FDR q-values - drug candidates with tier, score breakdown, convergence score - hypotheses with PMID citations and evidence status - executive brief - analysis ID for citation

Each POST creates a fresh server transport instance. Responses carry Access-Control-Allow-Origin: *. The MCP interface is the primary integration surface for embedding GaiaLab into research workflow automation.

Researcher and Enterprise tier API keys bypass the IP-based daily quota gate. Free-tier users accessing the MCP endpoint are subject to the same daily limit as the web interface.

Calibration

Prospective Prediction Tracking

Every drug repurposing prediction made by GaiaLab is recorded at analysis time with a confidence score, the disease context, and the date. The prediction tracker periodically polls ClinicalTrials.gov v2 to check whether a trial for that drug–disease pair has completed, and if so, what outcome was reported.

Outcome labelling

Outcome	Score
Trial completed with positive result	1.0
Trial completed with mixed result	0.5
Trial completed, neutral / inconclusive	0.25
Trial terminated, no trial found, or negative	0.0

Retrospective AUROC benchmark (2026-03 snapshot, N = 529, 22 disease areas)

AUROC: 0.545 (95% CI bootstrap: 0.526–0.562) vs 0.50 random baseline Modest but consistent signal above chance. Brier Score: live at /api/predictions/calibration (vs 0.25 no-skill baseline — lower is better) Formula: (1/N) Σ (predicted_confidence − outcome)² Below 0.25 beats no-skill, BUT must be read with ECE — a low Brier under high ECE reflects class imbalance, not calibrated confidence. Concordance: 77% — fraction of predictions with a matching ClinicalTrials.gov entry

AUROC 0.545 is not a clinically validated predictor. It represents a modest, real signal above random (0.50) on a historical snapshot. GaiaLab is a hypothesis-generation tool — these metrics show the predictions are not random, but cannot predict which specific drug will succeed in any given trial.

Full calibration curve: GET /api/predictions/calibration. Individual predictions: GET /api/predictions. Both endpoints are public and unauthenticated.

Reproducibility

Immutable Analysis IDs

Every analysis run generates a permanent ID of the form gl-{timestamp}-{8-char-hash}. This ID is:

Included in API responses and the analysis UI
Linkable as a permanent URL: https://gailabai.com/analysis/{id}
Safe to cite in paper supplementary materials
Stored as an immutable JSON snapshot in data/snapshots/

Snapshot files record the exact gene list, disease context, all database responses, all scored outputs, and the AI synthesis. A snapshot can be replayed to verify that the same inputs produce equivalent outputs under the same database state.

Analysis IDs do not guarantee database state reproducibility — external databases update over time. For full reproducibility, include the analysis ID AND the snapshot file in supplementary materials.

Prediction accuracy

GaiaLab prospectively records drug repurposing predictions and cross-references them against ClinicalTrials.gov outcomes. Live calibration metrics are published at /api/predictions/calibration — read the Brier and ECE together (a low Brier under high ECE reflects class imbalance, not calibrated confidence):

Brier Score: live (vs 0.25 no-skill baseline — lower is better) Formula: (1/N) Σ (predicted_confidence − outcome)² outcome = 1.0 if trial matched (positive), 0.5 if mixed, 0.25 if neutral, 0.0 if negative or no trial found Concordance: 77% (candidates with a matching ClinicalTrials.gov entry) AUROC: 0.545 (0.50 = random baseline)

A Brier Score below 0.25 indicates confidence scores carry genuine calibration signal. Full calibration curve: GET /api/predictions/calibration.

Limitations

Known Limitations

Database coverage gaps

Without paid API keys (DisGeNET, DrugBank), coverage falls to ~30/75+ databases. These gaps are disclosed in the analysis output and do not produce false confidence — missing sources are simply absent, not filled with hallucinated data.

AI synthesis is probabilistic

The six AI agents reason from structured data but can still produce plausible-sounding errors. All AI output is gated by the PMID evidence ledger — claims without citation support are demoted. Users should treat the executive brief as a hypothesis generator, not a clinical decision tool.

Small panels (< 3 genes)

Pathway enrichment and drug scoring are less reliable with fewer than 3 genes. The hypergeometric test loses power and synthetic lethality detection is disabled. Results for single-gene queries are labelled accordingly.

Non-human species

GaiaLab is optimised for human gene symbols. Mouse orthologs (e.g. Trp53) are partially supported via alias resolution but may miss sources that do not cross-reference species.

Not a clinical decision support tool

GaiaLab is a research intelligence platform. Outputs are not validated for clinical use and should not inform patient treatment decisions without independent expert review. Independent regulatory validation is required before any therapeutic or clinical application.

Evidence grounding variability

The grounded ratio — the fraction of insight items backed by at least one validated PMID — ranges from 28% (cold start, PubMed rate-limited) to 70%+ (warm literature cache, full paper pool). Cold-start runs occur after server restart when the 5-minute literature cache is empty; a second run on the same gene panel will consistently score higher. The grounded ratio is reported on every analysis output. When it falls below 15%, the system suppresses speculative claims and labels the analysis as conservatively synthesised.

AI synthesis provider availability

GaiaLab's 6-role AI debate depends on API quota from external providers (DeepSeek, Anthropic, OpenAI, Google). When all providers reach daily quota limits simultaneously, analyses complete using database-structured outputs only, without AI synthesis. The executive brief section is labelled "quota-limited synthesis" in these cases. Provider quota status is visible at /api/health.

¹ Active source count varies with API key configuration. Full source list in Section 3 — Data Sources. Without optional paid keys (DisGeNET, DrugBank), active coverage is approximately 60 sources. The count of 75+ reflects the full set of integrated clients shipped with the platform.

Analysis Pipeline

Gene normalisation

Parallel data fetch — 75+ databases1

Channel aggregation

Scoring and classification

6-role AI debate

Data Sources

Gene annotation & variation

Pathway & functional annotation

Interaction & network

Literature

Drug, clinical & regulatory

Omics & cancer

Structural

FDR-Corrected Pathway Enrichment

Hypergeometric test

BH correction

Citation Verification & Hallucination Detection

Stage 0 — PMID existence check

Stage 1 — NLI entailment check

Stage 2 — ALCE-style cite metrics

Multi-agent citation floor

Relation-Aware Drug Scoring

Scoring formula

Tier classification

Tier I

Tier II

Tier III

Filters applied before scoring

Convergence Scoring

6-Role AI Debate

Hypothesis Agent

Critic Agent

Evidence Agent

Risk Agent

Innovation Agent

Synthesis Agent

Provider assignment per role

Confidence Tiers

MCP Server Interface

Tool: gaialab_generate_insights

Prospective Prediction Tracking

Outcome labelling

Retrospective AUROC benchmark (2026-03 snapshot, N = 529, 22 disease areas)

Immutable Analysis IDs

Prediction accuracy

Known Limitations

Database coverage gaps

AI synthesis is probabilistic

Small panels (< 3 genes)

Non-human species

Not a clinical decision support tool

Evidence grounding variability

AI synthesis provider availability

Parallel data fetch — 75+ databases¹

Tool: `gaialab_generate_insights`