Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Aug 6, 2024
Date Accepted: Dec 25, 2024
Evaluating and Enhancing Japanese LLMs for Genetic Counseling Support: A Comparative Study of Domain Adaptation and the Development of Expert-Evaluated Dataset
ABSTRACT
Background:
The field of genetics has made significant advancements, revealing a strong correlation between genetics and health. Consequently, the demand for genetic counseling services to address genetic issues has increased. Consequently, the shortage of professionals in the realm of genetic counseling has posed a significant challenge. The emergence of large language models (LLMs) in recent years offers a potential solution to this issue. However, the current status and issues of genetic counseling in Japanese LLMs require further investigation. Additionally, to develop a dialogue system to support genetic counseling in the future, domain adaptation methods of LLMs should be explored, and expert data should be collected to assess the quality of LLM responses.
Objective:
This study aims to evaluate the current capabilities and identify obstacles in developing a dialogue system based on LLM for genetic counseling. The primary focus is to assess the effectiveness of domain adaptation methods within the context of genetic counseling. Furthermore, we will establish a dataset in which experts can evaluate responses generated by LLMs adapted with various domain adaptation methods to gather expert feedback for the future development of genetic counseling LLMs.
Methods:
Our study utilized two main datasets: (1) a question-answering (QA) dataset for LLM adaptation and (2) A genetic counseling question dataset for evaluation. The QA dataset comprised 899 pairs covering topics in medicine and genetic counseling, whereas the evaluation dataset comprised 120 refined questions across six genetic counseling categories. Three domain adaptation methods— instruction tuning, retrieval-augmented generation (RAG), and prompt engineering—were applied to a lightweight Japanese LLM. The performance of the adapted LLM was evaluated using a dataset of 120 carefully selected questions on genetic counseling. Two certified genetic counselors and one ophthalmologist assessed the responses generated by the LLM based on four key metrics: (1) inappropriateness of information, (2) sufficiency of information, (3) severity of harm, and (4) alignment with medical consensus.
Results:
The evaluation conducted by certified genetic counselors and ophthalmologist revealed varied outcomes across different domain adaptation methods. RAG demonstrated promising results, particularly in enhancing key aspects of genetic counseling. Conversely, instruction tuning and prompt engineering yielded less favorable outcomes. This evaluation process led to the construction of a dataset of expert-evaluated responses generated by LLMs adapted using various combinations of these methods. Error analysis highlighted critical ethical concerns, such as the inappropriate promotion of prenatal testing, criticism of relatives, and inaccurate probability statements.
Conclusions:
RAG has significantly improved performance in all evaluation criteria, with the potential for further enhancement through the expansion of RAG data. Our expert-evaluated dataset offers valuable insights into future developments. However, the ethical issues identified in LLM responses underscore the importance of continued refinement and careful ethical considerations prior to the implementation of these systems in healthcare settings.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.