Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 10, 2024
Date Accepted: Aug 15, 2025

The final, peer-reviewed published version of this preprint can be found here:

Generative Models and Sentence Transformers for the Recognition and Normalization of Continuous and Discontinuous Phenotype Mentions: Model Development and Evaluation

Alhassan A, Schlegel V, Aloud M, Batista-Navarro R, Nenadic G

Generative Models and Sentence Transformers for the Recognition and Normalization of Continuous and Discontinuous Phenotype Mentions: Model Development and Evaluation

JMIR Med Inform 2025;13:e68558

DOI: 10.2196/68558

PMID: 41191926

PMCID: 12631088

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

DiscHPO: Generative Models and Sentence Transformers for the Recognition and Normalisation of Continuous and Discontinuous Phenotype Mentions

  • Areej Alhassan; 
  • Viktor Schlegel; 
  • Monira Aloud; 
  • Riza Batista-Navarro; 
  • Goran Nenadic

ABSTRACT

Background:

Extracting genetic phenotype mentions from clinical reports and normalising them to standardised concepts within the HPO ontology are essential for consistent interpretation and representation of genetic conditions. This is particularly important in fields such as dysmorphology and plays a key role in advancing personalised healthcare. However, modern clinical Named Entity Recognition (NER) methods face challenges in accurately identifying discontinuous mentions (i.e., entity spans that are interrupted by unrelated words) which can be found in these clinical reports.

Objective:

This study aims to develop a system that can accurately extract and normalise genetic phenotypes, specifically from physical examination reports related to dysmorphology assessment. These mentions appear in both continuous and discontinuous lexical forms, with a focus on addressing challenging disjoint (discontinuous) entity spans.

Methods:

We introduce DiscHPO, a two-phase pipeline consisting of (1) a sequence-to-sequence NER model for span extraction, and (2) an entity normaliser that employs a Sentence Transformer bi-encoder for candidate generation and a cross-encoder re-ranker for selecting the best candidate as the normalised concept. This system was tested as part of our participation in Track 3 of the BioCreative VIII shared task.

Results:

For overall performance on the test set, the top-performing model for entity normalisation achieved an F1 score of 0.7229, while the best span extraction model reached an F1 score of 0.6647. Both scores surpassed those of two baseline models using the same dataset, indicating superior efficacy in handling both continuous and discontinuous spans. Approximately 14% of entity mentions in the dataset are disjoint spans. On the validation set, we were able to demonstrate our system's ability to recognise these mentions, with the model achieving an F1 score of 0.6235 for exact match on discontinuous spans only.

Conclusions:

The findings suggest that exact extraction of entity spans may not always be necessary for successful normalisation. Partial mention matches can be sufficient as long as they capture the essential concept information, supporting the system’s utility in clinical downstream tasks.


 Citation

Please cite as:

Alhassan A, Schlegel V, Aloud M, Batista-Navarro R, Nenadic G

Generative Models and Sentence Transformers for the Recognition and Normalization of Continuous and Discontinuous Phenotype Mentions: Model Development and Evaluation

JMIR Med Inform 2025;13:e68558

DOI: 10.2196/68558

PMID: 41191926

PMCID: 12631088

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.