JMIR Preprints #77561: Prompting and Fine-tuning LLMs for Parkinson’s Disease Diagnosis: A Study using PPMI structured dataset

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Prompting and Fine-tuning LLMs for Parkinson’s Disease Diagnosis: A Study using PPMI structured dataset

Hyun-Ji Shin;
Young-Jin Jeong;
Sungmin Jun;
Do-Young Kang

ABSTRACT

Background:

Parkinson’s disease (PD) is a neurodegenerative disorder characterized by a wide spectrum of motor and nonmotor symptoms, making its accurate diagnosis particularly challenging. While the increasing availability of structured clinical data is broadly recognized in the field, this study specifically focused on the Parkinson’s Progression Markers Initiative (PPMI) dataset to investigate the diagnostic capacity of large language models (LLMs) for PD.

Objective:

This study aimed to investigate whether LLMs can accurately classify PD using structured clinical data, and how different prompting and fine-tuning strategies influence diagnostic outcomes.

Methods:

We analyzed structured clinical data from 1,238 participants enrolled in PPMI, using 1,052 samples for training and 186 for testing. Clinical variables were converted into natural language prompts using four formats: plain text (PT), markdown (MD), special token (ST), and markdown with special token (MD+ST). We conducted three core experiments: (1) zero- to three-shot prompting across seven LLMs from four model families, (2) supervised fine-tuning of three lightweight models, and (3) dual-output prompting using four flagship models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and LLaMA 3.3 70B) to examine whether reasoning generation influenced diagnostic classification.

Results:

Conclusions:

Among the seven LLMs evaluated across the four prompt formats and zero- to three-shot settings, the highest F1-score (0.995) was achieved by LLaMA 3.3 70B and Gemini 1.5 Pro under few-shot prompting, both of which demonstrated perfect recall and zero inconsistency. Fine-tuned lightweight models, including GPT-4o-mini and LLaMA 3.1 8B, achieved equivalent performance (F1 = 0.995), confirming the efficiency of supervised adaptation. Under dual-output prompting, which requires models to generate both diagnostic labels and three-sentence inferences, slight declines in performance were observed (e.g., LLaMA 3.3 70B, F1 = 0.990; inconsistency = 2), suggesting that simultaneous reasoning generation may affect diagnostic reliability Clinical Trial: Among the seven LLMs evaluated across the four prompt formats and zero- to three-shot settings, the highest F1-score (0.995) was achieved by LLaMA 3.3 70B and Gemini 1.5 Pro under few-shot prompting, both of which demonstrated perfect recall and zero inconsistency. Fine-tuned lightweight models, including GPT-4o-mini and LLaMA 3.1 8B, achieved equivalent performance (F1 = 0.995), confirming the efficiency of supervised adaptation. Under dual-output prompting, which requires models to generate both diagnostic labels and three-sentence inferences, slight declines in performance were observed (e.g., LLaMA 3.3 70B, F1 = 0.990; inconsistency = 2), suggesting that simultaneous reasoning generation may affect diagnostic reliability

Citation

Please cite as:

Shin HJ, Jeong YJ, Jun S, Kang DY

Prompting and Fine-Tuning Large Language Models for Parkinson Disease Diagnosis: Comparative Evaluation Study Using the PPMI Structured Dataset

JMIR Med Inform 2026;14:e77561

DOI: 10.2196/77561

PMID: 41539675

PMCID: 12856398

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 15, 2025

Open Peer Review Period: May 15, 2025 - Jul 10, 2025

Date Accepted: Dec 24, 2025

(closed for review but you can still tweet)

Prompting and Fine-tuning LLMs for Parkinson’s Disease Diagnosis: A Study using PPMI structured dataset

ABSTRACT

Citation

Copyright