JMIR Preprints #87276: Linguistic Fidelity and Classification Performance of Large Language Models for Generating Synthetic Operative Notes: Evaluation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Linguistic Fidelity and Classification Performance of Large Language Models for Generating Synthetic Operative Notes: Evaluation Study

Meredith Cox;
Elaine Lin;
Nicholas Oleck;
Carlee Jones;
Neill Y. Li;
Suhail K. Mithani;
Alexander C. Allori

ABSTRACT

Background:

Machine learning models for surgical applications require large, diverse datasets, yet data scarcity remains a critical limitation due to privacy regulations, institutional variability, and the rarity of many surgical procedures. Large language models (LLMs) offer a potential solution through synthetic data generation, but their performance and reliability in specialized surgical domains remain underexplored.

Objective:

To evaluate the linguistic fidelity of LLM-generated operative notes for cleft lip and palate procedures and to assess their impact on natural language processing (NLP) classifier performance under varying data availability conditions.

Methods:

We obtained 630 authentic operative notes from cleft procedures (86 cleft lip repairs, 101 cleft palate repairs, 62 alveolar bone grafting procedures) performed between 2013-2024. GPT-4 generated matched synthetic notes using multishot prompting with anonymized examples. Linguistic fidelity was evaluated using BERTScore for semantic similarity, Jensen–Shannon divergence of part-of-speech trigrams for syntactic structure, and BLEU scores for lexical overlap. Binary classifiers using ClinicalBERT embeddings and logistic regression were trained under both full-data and data-scarce (10% of real notes) conditions, with and without synthetic augmentation.

Results:

Synthetic notes demonstrated high semantic fidelity across all procedures (BERTScore F1: 0.86–0.88) and low syntactic divergence (Jensen–Shannon divergence: 0.06–0.08). BLEU scores indicated moderate lexical variation (0.14–0.19), reflecting distinct but contextually consistent phrasing. With full datasets, synthetic augmentation did not affect classifier performance. Under data-scarce conditions (10% of authentic notes), synthetic augmentation improved AUC from 0.77 to 0.89 for cleft lip classification and from 0.84 to 0.89 for cleft palate, with smaller gains for alveolar bone grafting (0.92 to 0.94).

Conclusions:

LLM-generated operative notes exhibit strong semantic and syntactic fidelity to authentic documentation and can enhance model performance in a task-dependent manner when authentic data are limited. These findings suggest synthetic data generation may address data scarcity challenges in specialized surgical domains, particularly for rare or underrepresented procedures, while maintaining patient privacy and enabling robust ML model development.

Citation

Please cite as:

Cox M, Lin E, Oleck N, Jones C, Li NY, Mithani SK, Allori AC

Linguistic Fidelity and Classification Performance of Large Language Models for Generating Synthetic Operative Notes: Evaluation Study

JMIR Form Res 2026;10:e87276

DOI: 10.2196/87276

PMID: 42397952

PMCID: 13331248

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Nov 6, 2025

Date Accepted: May 25, 2026

Linguistic Fidelity and Classification Performance of Large Language Models for Generating Synthetic Operative Notes: Evaluation Study

ABSTRACT

Citation

Copyright