Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jan 10, 2025
Date Accepted: Apr 16, 2025

The final, peer-reviewed published version of this preprint can be found here:

Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study

Al-Garadi M, LeNoue-Newton1 MLN, Matheny M, McPheeters M, Whitaker J, Deere J, Westerman D, Khan M, Hernández-Muñoz JJHM, Wang X, Kuzucan A, Desai R, Reeves R

Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study

J Med Internet Res 2025;27:e71113

DOI: 10.2196/71113

PMID: 40824124

PMCID: 12359966

Automated Extraction of Mortality Information from Publicly Available Sources Using Language Models: Large Language Model–Based Study

  • Mohammed Al-Garadi; 
  • Michele LeNoue-Newton LeNoue-Newton1; 
  • Michael Matheny; 
  • Melissa McPheeters; 
  • Jill Whitaker; 
  • Jessica Deere; 
  • Dax Westerman; 
  • Mirza Khan; 
  • José J. Hernández-Muñoz Hernández-Muñoz; 
  • Xi Wang; 
  • Aida Kuzucan; 
  • Rishi Desai; 
  • Ruth Reeves

ABSTRACT

Background:

Background:

Mortality is a critical variable in healthcare research, but inconsistencies in the availability of death date and cause of death (CoD) information limit the ability to monitor medical product safety and effectiveness.

Objective:

Objective:

To develop scalable approaches using natural language processing (NLP) and large language models (LLM) for the extraction of mortality information from publicly available online data sources, including social media platforms, crowdfunding websites, and online obituaries.

Methods:

Methods. Data were collected from public posts on X (formerly Twitter), GoFundMe campaigns, memorial websites (EverLoved.com and TributeArchive.com), and online obituaries from 2015 to 2022. We developed a natural language processing (NLP) pipeline using transformer-based models to extract key mortality information such as decedent names, dates of birth, and dates of death. We then employed a few-shot learning (FSL) approach with large language models (LLMs) to identify primary and secondary causes of death. Model performance was assessed using precision, recall, F1-score, and accuracy metrics, with human-annotated labels serving as the reference standard for the transformer-based model and a human adjudicator blinded to labeling source for the FSL model reference standard.

Results:

Results:

The best-performing model obtained a micro-averaged F1-score of 0.88 (95% CI, 0.86-0.90) in extracting mortality information. The FSL-LLM approach demonstrated high accuracy in identifying primary CoD across various online sources. For GoFundMe, the FSL-LLM achieved 95.9% accuracy for primary cause identification, compared to 97.9% for human annotators. In obituaries, FSL-LLM accuracy was 96.5% for primary causes, while human accuracy was 99.0%. For memorial websites, FSL-LLM achieved 98.0% accuracy for primary causes, with human accuracy at 99.5%.

Conclusions:

Conclusions:

These findings highlight the potential of leveraging advanced NLP techniques and publicly available data to enhance the timeliness, comprehensiveness, and granularity of mortality surveillance.


 Citation

Please cite as:

Al-Garadi M, LeNoue-Newton1 MLN, Matheny M, McPheeters M, Whitaker J, Deere J, Westerman D, Khan M, Hernández-Muñoz JJHM, Wang X, Kuzucan A, Desai R, Reeves R

Automated Extraction of Mortality Information From Publicly Available Sources Using Large Language Models: Development and Evaluation Study

J Med Internet Res 2025;27:e71113

DOI: 10.2196/71113

PMID: 40824124

PMCID: 12359966

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.