JMIR Preprints #85613: Performance Comparison of Human Doctors and Large Language Models in Tuberculosis Triage, Diagnosis, and Management:An Experimental Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance Comparison of Human Doctors and Large Language Models in Tuberculosis Triage, Diagnosis, and Management:An Experimental Study

Jin Liao;
Wenjun He;
Huiyi Pan;
Lanping Zhang;
Xingyan Li;
Jiamin Huang;
zhichao Liu;
Xue Ke;
Jian Li;
Xue Li;
Candice Hwang;
Haiting Cai;
Guobao Li;
jinghui Chang

ABSTRACT

Background:

Tuberculosis (TB) remains a major global health challenge, particularly in low- and middle-income countries, where effective triage, diagnosis, and management are often limited. Existing decision-support tools focus on imaging and cannot integrate multi-modal clinical information, constraining their utility in complex clinical scenarios. Large Language Models (LLMs) have shown promise in assisting diagnosis and clinical decision-making in other medical fields, but evidence for their application in TB care is scarce. Evaluating LLMs for TB decision support is crucial to explore their potential to improve clinical accuracy, efficiency, and quality of care in high-burden, resource-limited settings.

Objective:

To evaluate whether large language models (LLMs) can assist tuberculosis (TB) physicians in clinical decision-making across triage, differential diagnosis, and management recommendation tasks, addressing potential delays and inequities in TB care.

Methods:

In this experimental comparative study conducted in 2025 under STARD guidelines, 17 standardized TB cases (7 simulated, 10 real) were assessed. Responses were generated by two advanced LLMs (ChatGPT-4o and DeepSeek-R1) and two TB physicians. Reference standards were established by three TB specialists. Objective performance was measured using precision, recall, and F1 scores. Subjective evaluation assessed suitability, information quality, and, for management tasks, safety, conciseness, understandability, and operability using 5-point Likert scales. Readability was measured by a Chinese R-value; group differences were analyzed using Mann-Whitney U tests.

Results:

LLMs achieved precision similar to physicians across all tasks (median 0.67 vs 0.50; U = 8695.5; P = .35) but higher recall (0.53 vs 0.33; U = 6848.5; P < .001) and F1 scores (0.58 vs 0.33; U = 7085.5; P < .001) in management recommendation tasks. In management tasks, LLMs outperformed physicians in recall (0.50 vs 0.20; U = 185.0; P < .001) and F1 (0.50 vs 0.30; U = 104.0; P < .001), with no difference in precision. Subjectively, LLMs scored higher in suitability (3.67 vs 3.00; U = 1122.0; P < .001), information quality (3.33 vs 2.67; U = 155.0; P < .001), understandability (3.67 vs 3.00; U = 4281.5; P = .022), and operability (3.67 vs 3.00; U = 4305.0; P = .025). No differences were observed in conciseness (P = .54) or safety (P = .06). Physicians’ responses were more readable (1.88 vs 2.17; U = 11427.5; P < .001).

Conclusions:

LLMs can serve as adjuncts to support TB clinical decision-making, enhancing management recommendations without replacing physicians. Their use may improve decision efficiency and help reduce disparities in TB care. Clinical Trial: This experimental comparative study evaluating large language models versus tuberculosis physicians did not involve patient interventions or randomization, and therefore was not registered as a clinical trial.

Citation

Please cite as:

Liao J, He W, Pan H, Zhang L, Li X, Huang J, Liu z, Ke X, Li J, Li X, Hwang C, Cai H, Li G, Chang j

Performance Comparison of Human Doctors and Large Language Models in Tuberculosis Triage, Diagnosis, and Management:An Experimental Study

JMIR Preprints. 10/10/2025:85613

DOI: 10.2196/preprints.85613

URL: https://preprints.jmir.org/preprint/85613

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: Transfer Hub (manuscript eXchange)

Date Submitted: Oct 10, 2025

Open Peer Review Period: Oct 13, 2025 - Dec 8, 2025

(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Performance Comparison of Human Doctors and Large Language Models in Tuberculosis Triage, Diagnosis, and Management:An Experimental Study

ABSTRACT

Citation

Copyright