Frontier AI Models Outperform Specialized Clinical Tools on Every Benchmark — With Implications Beyond Medicine

A peer-reviewed study published in Nature Medicine finds that general-purpose frontier AI models — GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 — consistently outperform specialized clinical AI tools in medical settings. The findings carry a broader implication for any organization evaluating whether to buy purpose-built AI tools or rely on frontier models.
Researchers from NYU Langone Health ran three evaluations: 500 US Medical Licensing Examination-style questions (MedQA), 500 clinician-alignment items (HealthBench), and 100 real physician queries (RCQ) drawn from live clinical deployments. The clinical tools evaluated were OpenEvidence and UpToDate Expert AI — both built on large language models and designed specifically for medical use.
Frontier models won across all three stages. On MedQA, Gemini scored 97.4%, GPT 94.2%, and Claude 90.2% — compared to 89.6% for OpenEvidence and 88.4% for UpToDate. On HealthBench, GPT scored 88.0 versus 62.6 and 61.3 for the clinical tools. On real physician queries, clinical tools had 49–87% lower odds of receiving a higher clinician rating than Gemini. Google Search AI Overview matched — not exceeded — the clinical AI tools in the real-world query evaluation.
Key Takeaways
- Specialized AI tools did not outperform frontier models on medical knowledge, expert clinical alignment, or real-world physician queries.
- Scale and alignment may outweigh domain-specific tuning for tasks that primarily involve knowledge retrieval and reasoning.
- Procurement and regulatory implications: the authors call for independent evaluation of AI tools before clinical adoption — a principle that applies to AI procurement in any sector.
The study is open access and the code is publicly available at github.com/nyuolab/clinical-llm-benchmarks.
Read the full article on Nature Medicine
Stay in Rhythm
Subscribe for insights that resonate • from strategic leadership to AI-fueled growth. The kind of content that makes your work thrum.
More from Thrum
Additional pieces exploring adjacent ideas
