Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: Transfer Hub (manuscript eXchange)

Date Submitted: Oct 10, 2025
Open Peer Review Period: Oct 13, 2025 - Dec 8, 2025
(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Warning: This is a unreviewed preprint (What is a preprint?). Readers are warned that the document has not been peer-reviewed by expert/patient reviewers or an academic editor, may contain misleading claims, and is likely to undergo changes before final publication, if accepted, or may have been rejected/withdrawn (a note "no longer under consideration" will appear above).

Peer review me: Readers with interest and expertise are encouraged to sign up as peer-reviewer, if the paper is within an open peer-review period (in this case, a "Peer Review Me" button to sign up as reviewer is displayed above). All preprints currently open for review are listed here. Outside of the formal open peer-review period we encourage you to tweet about the preprint.

Citation: Please cite this preprint only for review purposes or for grant applications and CVs (if you are the author).

Final version: If our system detects a final peer-reviewed "version of record" (VoR) published in any journal, a link to that VoR will appear below. Readers are then encourage to cite the VoR instead of this preprint.

Settings: If you are the author, you can login and change the preprint display settings, but the preprint URL/DOI is supposed to be stable and citable, so it should not be removed once posted.

Submit: To post your own preprint, simply submit to any JMIR journal, and choose the appropriate settings to expose your submitted version as preprint.

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Performance Comparison of Human Doctors and Large Language Models in Tuberculosis Triage, Diagnosis, and Management:An Experimental Study

  • Jin Liao; 
  • Wenjun He; 
  • Huiyi Pan; 
  • Lanping Zhang; 
  • Xingyan Li; 
  • Jiamin Huang; 
  • zhichao Liu; 
  • Xue Ke; 
  • Jian Li; 
  • Xue Li; 
  • Candice Hwang; 
  • Haiting Cai; 
  • Guobao Li; 
  • jinghui Chang

ABSTRACT

Background:

Tuberculosis (TB) remains a major global health challenge, particularly in low- and middle-income countries, where effective triage, diagnosis, and management are often limited. Existing decision-support tools focus on imaging and cannot integrate multi-modal clinical information, constraining their utility in complex clinical scenarios. Large Language Models (LLMs) have shown promise in assisting diagnosis and clinical decision-making in other medical fields, but evidence for their application in TB care is scarce. Evaluating LLMs for TB decision support is crucial to explore their potential to improve clinical accuracy, efficiency, and quality of care in high-burden, resource-limited settings.

Objective:

To evaluate whether large language models (LLMs) can assist tuberculosis (TB) physicians in clinical decision-making across triage, differential diagnosis, and management recommendation tasks, addressing potential delays and inequities in TB care.

Methods:

In this experimental comparative study conducted in 2025 under STARD guidelines, 17 standardized TB cases (7 simulated, 10 real) were assessed. Responses were generated by two advanced LLMs (ChatGPT-4o and DeepSeek-R1) and two TB physicians. Reference standards were established by three TB specialists. Objective performance was measured using precision, recall, and F1 scores. Subjective evaluation assessed suitability, information quality, and, for management tasks, safety, conciseness, understandability, and operability using 5-point Likert scales. Readability was measured by a Chinese R-value; group differences were analyzed using Mann-Whitney U tests.

Results:

LLMs achieved precision similar to physicians across all tasks (median 0.67 vs 0.50; U = 8695.5; P = .35) but higher recall (0.53 vs 0.33; U = 6848.5; P < .001) and F1 scores (0.58 vs 0.33; U = 7085.5; P < .001) in management recommendation tasks. In management tasks, LLMs outperformed physicians in recall (0.50 vs 0.20; U = 185.0; P < .001) and F1 (0.50 vs 0.30; U = 104.0; P < .001), with no difference in precision. Subjectively, LLMs scored higher in suitability (3.67 vs 3.00; U = 1122.0; P < .001), information quality (3.33 vs 2.67; U = 155.0; P < .001), understandability (3.67 vs 3.00; U = 4281.5; P = .022), and operability (3.67 vs 3.00; U = 4305.0; P = .025). No differences were observed in conciseness (P = .54) or safety (P = .06). Physicians’ responses were more readable (1.88 vs 2.17; U = 11427.5; P < .001).

Conclusions:

LLMs can serve as adjuncts to support TB clinical decision-making, enhancing management recommendations without replacing physicians. Their use may improve decision efficiency and help reduce disparities in TB care. Clinical Trial: This experimental comparative study evaluating large language models versus tuberculosis physicians did not involve patient interventions or randomization, and therefore was not registered as a clinical trial.


 Citation

Please cite as:

Liao J, He W, Pan H, Zhang L, Li X, Huang J, Liu z, Ke X, Li J, Li X, Hwang C, Cai H, Li G, Chang j

Performance Comparison of Human Doctors and Large Language Models in Tuberculosis Triage, Diagnosis, and Management:An Experimental Study

JMIR Preprints. 10/10/2025:85613

DOI: 10.2196/preprints.85613

URL: https://preprints.jmir.org/preprint/85613

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.