Open-source AI improved doctors' rare-disease diagnoses in trial — but increased high-confidence errors

By ChatGPT — AI-generated · Published:

A new preprint reports that an open-source artificial intelligence model built for rare disease diagnosis improved physicians’ diagnostic accuracy by 21.44 percentage points in a randomized assistance trial compared with internet search alone.

But the study was an early, case-based test rather than a real-world clinical deployment, and it also found a potential safety concern: When AI-assisted physicians were wrong, they were more likely to be wrong with high confidence.

The paper, posted to arXiv on June 23, is titled “A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial.” It comes from a multi-institutional team including researchers affiliated with Tsinghua University and Tsinghua Medicine, Peking Union Medical College Hospital, and other centers in China and Singapore.

The model, called RaDaR, short for Rare Disease navigatoR, is described as a 32-billion-parameter reasoning large language model specialized for rare-disease diagnosis. According to the paper, it was trained on 49,170 publicly available free-text cases and 104,666 synthetic cases, for about 154,000 cases in total.

Rare diseases collectively affect about 400 million people worldwide, and the paper cites an average diagnostic delay of about 6.2 years. That long delay helps explain the appeal of tools designed to support earlier recognition of uncommon conditions. What makes this study stand out is not just a benchmark score, but a randomized trial testing whether the model changed doctors’ diagnostic lists.

“In a randomized physician-assistance trial, RaDaR assistance improved physicians’ rare-disease diagnostic accuracy by 21.44 percentage points compared with internet search alone (P < 0.0001),” the paper’s abstract says.

The trial was prospectively registered in the Chinese Clinical Trial Registry as ChiCTR2500115619 and received ethics approval from the Tsinghua University Ethics Committee. Physicians were recruited from Jan. 1 to Jan. 5, 2026, then stratified by specialty and years of experience before randomization.

A total of 84 physicians from 28 hospitals were enrolled, and 76 completed the trial. Among those who completed it, 40 were assigned to the control group and 36 to the RaDaR-assisted group. The median clinical experience among completers was 4 years, with an interquartile range of 2 to 9 years.

The study compared two approaches. The control group used internet search alone. The intervention group used internet search plus RaDaR. The primary outcome was physician-level diagnostic accuracy, defined as the share of five assigned cases in which the physician included the final diagnosis somewhere in the differential diagnosis list.

On that measure, the RaDaR-assisted physicians scored 49.44% on average, compared with 28.00% in the control arm. The absolute difference was 21.44 percentage points, with a 95% confidence interval of 11.20 to 31.68. The paper reported a P value of less than 0.0001. Diagnostic time per case did not differ significantly between the two groups, meaning the higher accuracy did not come with a measured time advantage.

The trial, however, did not test the system on live patients. It used 50 confirmed rare-disease cases from an in-house dataset at Peking Union Medical College Hospital. The case narratives were truncated before explicit diagnosis and definitive tests to simulate the information available earlier in the diagnostic process.

The paper also reported a mixed picture on physician confidence. Participants in the RaDaR arm rated the tool as more helpful, with a self-reported helpfulness score of 4.19 versus 3.36 in the control arm. At the same time, calibration was worse in the assisted arm: 0.48 versus 0.39. In plain terms, their confidence was less well matched to whether they were actually correct.

That concern showed up more sharply in the error analysis. Among incorrectly diagnosed cases, high-confidence incorrect responses were more common with RaDaR assistance than with internet search alone, 65.9% versus 27.1%. The paper flags that pattern as a potential automation-bias risk — a sign that the AI may sometimes make clinicians more certain even when it is steering them wrong.

Beyond the randomized trial, the authors reported a retrospective multi-visit cohort analysis in which RaDaR prioritized the final diagnosis before documented clinical suspicion in 61.06% of cases, corresponding to a reported median potential lead time of 1.87 months. They also said RaDaR was the strongest-performing evaluated open-source model across their reported benchmarks and four external validation centers, including against evaluated open-source baselines such as DeepSeek-R1.

For now, though, the main takeaway is narrower. This is an early preprint describing a physician-assistance experiment using curated rare-disease cases, not proof of better patient outcomes in routine care. The open-science pitch is unusually strong: The paper says model weights and datasets are available on Hugging Face, and the training and synthesis code is on GitHub. The disclosure section says Tien Yin Wong reported consulting fees from several pharmaceutical companies and startup and patent interests, while other authors reported no competing interests.

Tags: #ai, #raredisease, #diagnosis, #medicalai