Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Nov 6, 2025
Date Accepted: May 25, 2026

The final, peer-reviewed published version of this preprint can be found here:

Linguistic Fidelity and Classification Performance of Large Language Models for Generating Synthetic Operative Notes: Evaluation Study

Cox M, Lin E, Oleck N, Jones C, Li NY, Mithani SK, Allori AC

Linguistic Fidelity and Classification Performance of Large Language Models for Generating Synthetic Operative Notes: Evaluation Study

JMIR Form Res 2026;10:e87276

DOI: 10.2196/87276

PMID: 42397952

Linguistic Fidelity and Classification Performance of Large Language Models for Generating Synthetic Operative Notes: Evaluation Study

  • Meredith Cox; 
  • Elaine Lin; 
  • Nicholas Oleck; 
  • Carlee Jones; 
  • Neill Y. Li; 
  • Suhail K. Mithani; 
  • Alexander C. Allori

ABSTRACT

Background:

Machine learning models for surgical applications require large, diverse datasets, yet data scarcity remains a critical limitation due to privacy regulations, institutional variability, and the rarity of many surgical procedures. Large language models (LLMs) offer a potential solution through synthetic data generation, but their performance and reliability in specialized surgical domains remain underexplored.

Objective:

To evaluate the linguistic fidelity of LLM-generated operative notes for cleft lip and palate procedures and to assess their impact on natural language processing (NLP) classifier performance under varying data availability conditions.

Methods:

We obtained 630 authentic operative notes from cleft procedures (86 cleft lip repairs, 101 cleft palate repairs, 62 alveolar bone grafting procedures) performed between 2013-2024. GPT-4 generated matched synthetic notes using multishot prompting with anonymized examples. Linguistic fidelity was evaluated using BERTScore for semantic similarity, Jensen–Shannon divergence of part-of-speech trigrams for syntactic structure, and BLEU scores for lexical overlap. Binary classifiers using ClinicalBERT embeddings and logistic regression were trained under both full-data and data-scarce (10% of real notes) conditions, with and without synthetic augmentation.

Results:

Synthetic notes demonstrated high semantic fidelity across all procedures (BERTScore F1: 0.86–0.88) and low syntactic divergence (Jensen–Shannon divergence: 0.06–0.08). BLEU scores indicated moderate lexical variation (0.14–0.19), reflecting distinct but contextually consistent phrasing. With full datasets, synthetic augmentation did not affect classifier performance. Under data-scarce conditions (10% of authentic notes), synthetic augmentation improved AUC from 0.77 to 0.89 for cleft lip classification and from 0.84 to 0.89 for cleft palate, with smaller gains for alveolar bone grafting (0.92 to 0.94).

Conclusions:

LLM-generated operative notes exhibit strong semantic and syntactic fidelity to authentic documentation and can enhance model performance in a task-dependent manner when authentic data are limited. These findings suggest synthetic data generation may address data scarcity challenges in specialized surgical domains, particularly for rare or underrepresented procedures, while maintaining patient privacy and enabling robust ML model development.


 Citation

Please cite as:

Cox M, Lin E, Oleck N, Jones C, Li NY, Mithani SK, Allori AC

Linguistic Fidelity and Classification Performance of Large Language Models for Generating Synthetic Operative Notes: Evaluation Study

JMIR Form Res 2026;10:e87276

DOI: 10.2196/87276

PMID: 42397952

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.