Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 15, 2025
Open Peer Review Period: May 15, 2025 - Jul 10, 2025
Date Accepted: Dec 24, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Prompting and Fine-Tuning Large Language Models for Parkinson Disease Diagnosis: Comparative Evaluation Study Using the PPMI Structured Dataset

Shin HJ, Jeong YJ, Jun S, Kang DY

Prompting and Fine-Tuning Large Language Models for Parkinson Disease Diagnosis: Comparative Evaluation Study Using the PPMI Structured Dataset

JMIR Med Inform 2026;14:e77561

DOI: 10.2196/77561

PMID: 41539675

PMCID: 12856398

Prompting and Fine-tuning LLMs for Parkinson’s Disease Diagnosis: A Study using PPMI structured dataset

  • Hyun-Ji Shin; 
  • Young-Jin Jeong; 
  • Sungmin Jun; 
  • Do-Young Kang

ABSTRACT

Background:

Parkinson’s disease (PD) is a neurodegenerative disorder characterized by a wide spectrum of motor and nonmotor symptoms, making its accurate diagnosis particularly challenging. While the increasing availability of structured clinical data is broadly recognized in the field, this study specifically focused on the Parkinson’s Progression Markers Initiative (PPMI) dataset to investigate the diagnostic capacity of large language models (LLMs) for PD.

Objective:

This study aimed to investigate whether LLMs can accurately classify PD using structured clinical data, and how different prompting and fine-tuning strategies influence diagnostic outcomes.

Methods:

We analyzed structured clinical data from 1,238 participants enrolled in PPMI, using 1,052 samples for training and 186 for testing. Clinical variables were converted into natural language prompts using four formats: plain text (PT), markdown (MD), special token (ST), and markdown with special token (MD+ST). We conducted three core experiments: (1) zero- to three-shot prompting across seven LLMs from four model families, (2) supervised fine-tuning of three lightweight models, and (3) dual-output prompting using four flagship models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and LLaMA 3.3 70B) to examine whether reasoning generation influenced diagnostic classification.

Results:

Among the seven LLMs evaluated across the four prompt formats and zero- to three-shot settings, the highest F1-score (0.995) was achieved by LLaMA 3.3 70B and Gemini 1.5 Pro under few-shot prompting, both of which demonstrated perfect recall and zero inconsistency. Fine-tuned lightweight models, including GPT-4o-mini and LLaMA 3.1 8B, achieved equivalent performance (F1 = 0.995), confirming the efficiency of supervised adaptation. Under dual-output prompting, which requires models to generate both diagnostic labels and three-sentence inferences, slight declines in performance were observed (e.g., LLaMA 3.3 70B, F1 = 0.990; inconsistency = 2), suggesting that simultaneous reasoning generation may affect diagnostic reliability.

Conclusions:

Among the seven LLMs evaluated across the four prompt formats and zero- to three-shot settings, the highest F1-score (0.995) was achieved by LLaMA 3.3 70B and Gemini 1.5 Pro under few-shot prompting, both of which demonstrated perfect recall and zero inconsistency. Fine-tuned lightweight models, including GPT-4o-mini and LLaMA 3.1 8B, achieved equivalent performance (F1 = 0.995), confirming the efficiency of supervised adaptation. Under dual-output prompting, which requires models to generate both diagnostic labels and three-sentence inferences, slight declines in performance were observed (e.g., LLaMA 3.3 70B, F1 = 0.990; inconsistency = 2), suggesting that simultaneous reasoning generation may affect diagnostic reliability Clinical Trial: Among the seven LLMs evaluated across the four prompt formats and zero- to three-shot settings, the highest F1-score (0.995) was achieved by LLaMA 3.3 70B and Gemini 1.5 Pro under few-shot prompting, both of which demonstrated perfect recall and zero inconsistency. Fine-tuned lightweight models, including GPT-4o-mini and LLaMA 3.1 8B, achieved equivalent performance (F1 = 0.995), confirming the efficiency of supervised adaptation. Under dual-output prompting, which requires models to generate both diagnostic labels and three-sentence inferences, slight declines in performance were observed (e.g., LLaMA 3.3 70B, F1 = 0.990; inconsistency = 2), suggesting that simultaneous reasoning generation may affect diagnostic reliability


 Citation

Please cite as:

Shin HJ, Jeong YJ, Jun S, Kang DY

Prompting and Fine-Tuning Large Language Models for Parkinson Disease Diagnosis: Comparative Evaluation Study Using the PPMI Structured Dataset

JMIR Med Inform 2026;14:e77561

DOI: 10.2196/77561

PMID: 41539675

PMCID: 12856398

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.