Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 4, 2024
Date Accepted: Sep 15, 2024
Use of SNOMED CT in Large Language Models: A Scoping Review
ABSTRACT
Background:
SNOMED CT serves as a widely adopted standardized terminology in electronic health records and common data models, garnering attention for its secondary applications as a biomedical knowledge source. While large language models commonly face "hallucination" challenges, integrating SNOMED CT as a knowledge base with LLMs has been proposed to improve natural language understanding and generation in the biomedical domain.
Objective:
We aimed to review the state-of-the-art methodologies for incorporating SNOMED CT into LLMs to enhance biomedical natural language understanding and generation tasks.
Methods:
A comprehensive review of SNOMED CT integration in language models was conducted by querying ACM Digital Library, ACL Anthology, IEEE Xplore, PubMed, and Embase for publications between 2018 and 2023. Thirty-seven papers were selected for the final review.
Results:
BERT and its fine-tuning variants were the mainstream baseline language models in the examined literature. The majority of studies (n=28) incorporated SNOMED CT contents, such as descriptions, relations, and entity types (classes), into the inputs of large language models or training corpora. Other approaches included incorporating SNOMED CT into additional fusion modules of language models or retrieving knowledge from SNOMED CT for inference. SNOMED CT-integrated large language models prevailed in natural language understanding tasks (n=30) such as entity typing, classification, and, most notably, medical concept normalization. The integrated models also encompassed natural language generation tasks (n=9), such as translation, summarization, and question answering. However, only a small number of studies reported performance differences before and after the SNOMED CT integration.
Conclusions:
As the utilization of SNOMED CT as a reliable knowledge source becomes more feasible, SNOMED CT-integrated language models hold the potential to warrant model accountability, demonstrating advancements in the tasks of comprehending and generating NL for downstream tasks in the biomedical realm. Future research is anticipated to be more cognizant of the advantage of incorporating SNOMED CT into large language models.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.