Live Evaluation

Benchmark Dashboard

Real-time scientific quality scores updated after every analysis run. All metrics are computed from live data — no cherry-picking.

Loading scores…
Live Scores

Scientific Quality Metrics

Four core quality signals measured continuously across every analysis. Green = meets bar, amber = marginal, red = below threshold.

Citation Coverage
% insights with ≥1 PMID
Grounded Ratio
% claims verified by NLI
Cite-F1
Citation precision × recall
PMID Hallucination
% fabricated citations (lower = better)
Historical Trend

Quality Over Time

Last 30 analysis runs. Each point represents one analysis session.

Score Trend

Citation coverage, grounded ratio, and confidence index across recent runs
Citation Coverage Grounded Ratio Confidence Index
Analysis Volume

Platform Activity

Cumulative counts across all GaiaLab analysis sessions.

Total Analyses
Saved Snapshots
35+
Data Sources
8
MCP Tools
Reproducibility

Run Your Own Evaluation

All benchmarks are reproducible. Clone the repo and run the gold standard eval in under 5 minutes.

# Clone and install
git clone https://github.com/gaialab/gaialab-app
cd gaialab-app && npm install

# Run gold standard benchmark (requires ANTHROPIC_API_KEY or OPENAI_API_KEY)
npm run eval:gold

# Run trust & reliability benchmarks
npm run eval:trust

# Generate HTML dashboard of results
npm run eval:dashboard
# Or run a single eval file directly:
node scripts/gaialab-eval.js --benchmark=gold --bypassCache=true