Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Feb 28, 2025
Open Peer Review Period: Feb 28, 2025 - Apr 25, 2025
Date Accepted: Apr 21, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models

Zhong W, Sun M, Yao S, Liu Y, Peng D, Liu Y, Yang K, Gao H, Yan H, Hao W, Yan Y, Yin C

Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models

J Med Internet Res 2025;27:e73233

DOI: 10.2196/73233

PMID: 40456109

PMCID: 12148245

Enhancing the Accuracy of Human Phenotype Ontology Identification: A Comparative Evaluation of Multimodal Large Language Models

  • Wei Zhong; 
  • Mingyue Sun; 
  • Shun Yao; 
  • YiFan Liu; 
  • Dingchuan Peng; 
  • Yan Liu; 
  • Kai Yang; 
  • HuiMin Gao; 
  • HuiHui Yan; 
  • WenJing Hao; 
  • YouSheng Yan; 
  • ChengHong Yin

ABSTRACT

Background:

Identifying Human Phenotype Ontology (HPO) terms is crucial for diagnosing and managing rare diseases. However, clinicians, especially junior physicians, often face challenges due to the complexity of describing patient phenotypes accurately. Traditional manual search methods using HPO databases are time-consuming and prone to errors.

Objective:

To investigate whether the use of multimodal large language models (MLLMs) can improve the accuracy of junior physicians in identifying HPO terms from patient images related to rare diseases.

Methods:

Twenty junior physicians from 10 specialties participated. Each physician evaluated 27 patient images sourced from publicly available literature, with phenotypes relevant to rare diseases listed in the Chinese Rare Disease Catalogue. The study was divided into two groups: the manual search group relied on the Chinese Human Phenotype Ontology (CHPO) website, while the MLLM-assisted group used an electronic questionnaire that included HPO terms pre-identified by ChatGPT-4o as prompts, followed by a search using the CHPO. The primary outcome was the accuracy of HPO identification, defined as the proportion of correctly identified HPO terms compared to a standard set determined by an expert panel. Additionally, the accuracy of outputs from ChatGPT-4o and two open-source MLLMs (Llama3.2:11b and Llama3.2:90b) was evaluated using the same criteria, with hallucinations for each model documented separately. Furthermore, participating physicians completed an additional electronic questionnaire regarding their rare disease background to identify factors affecting their ability to accurately describe patient images using standardized HPO terms.

Results:

A total of 270 descriptions were evaluated per group. The MLLM-assisted group achieved a significantly higher accuracy rate of 67.41% compared to 20.37% in the manual group (RR = 3.31, 95% CI: 2.58–4.25, P < .001). The MLLM-assisted group demonstrated consistent performance across departments, whereas the manual group exhibited greater variability. Among standalone MLLMs, ChatGPT-4o achieved an accuracy of 48.15%, while the open-source models Llama3.2:11b and Llama3.2:90b achieved 14.81% and 18.52%, respectively. However, MLLMs exhibited a high hallucination rate, frequently generating HPO terms with incorrect IDs or entirely fabricated content. Specifically, ChatGPT-4o, Llama3.2:11b, and Llama3.2:90b generated incorrect IDs in 57.26% (67/117), 98.41% (62/63), and 82.14% (46/56) of cases, respectively, and fabricated terms in 34.18% (40/117), 41.27% (26/63), and 32.14% (18/56) of cases, respectively. Additionally, a survey on the rare disease knowledge of junior physicians suggests that participation in rare disease and genetic disease training may enhance the performance of some physicians.

Conclusions:

The integration of MLLMs into clinical workflows significantly enhances the accuracy of HPO identification by junior physicians, offering promising potential to improve the diagnosis of rare diseases and standardize phenotype descriptions in medical research. However, the notable hallucination rate observed in MLLMs underscores the necessity for further refinement and rigorous validation before widespread adoption in clinical practice.


 Citation

Please cite as:

Zhong W, Sun M, Yao S, Liu Y, Peng D, Liu Y, Yang K, Gao H, Yan H, Hao W, Yan Y, Yin C

Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models

J Med Internet Res 2025;27:e73233

DOI: 10.2196/73233

PMID: 40456109

PMCID: 12148245

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.