Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Aug 6, 2024
Date Accepted: Dec 25, 2024

The final, peer-reviewed published version of this preprint can be found here:

Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset

Fukushima T, Manabe M, Yada S, Wakamiya S, Yoshida A, Urakawa Y, Maeda A, Kan S, Takahashi M, Aramaki E

Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset

JMIR Med Inform 2025;13:e65047

DOI: 10.2196/65047

PMID: 39819819

PMCID: 11783024

Evaluating and Enhancing Japanese LLMs for Genetic Counseling Support: A Comparative Study of Domain Adaptation and the Development of Expert-Evaluated Dataset

  • Takuya Fukushima; 
  • Masae Manabe; 
  • Shuntaro Yada; 
  • Shoko Wakamiya; 
  • Akiko Yoshida; 
  • Yusaku Urakawa; 
  • Akiko Maeda; 
  • Shigeyuki Kan; 
  • Masayo Takahashi; 
  • Eiji Aramaki

ABSTRACT

Background:

The field of genetics has made significant advancements, revealing a strong correlation between genetics and health. Consequently, the demand for genetic counseling services to address genetic issues has increased. Consequently, the shortage of professionals in the realm of genetic counseling has posed a significant challenge. The emergence of large language models (LLMs) in recent years offers a potential solution to this issue. However, the current status and issues of genetic counseling in Japanese LLMs require further investigation. Additionally, to develop a dialogue system to support genetic counseling in the future, domain adaptation methods of LLMs should be explored, and expert data should be collected to assess the quality of LLM responses.

Objective:

This study aims to evaluate the current capabilities and identify obstacles in developing a dialogue system based on LLM for genetic counseling. The primary focus is to assess the effectiveness of domain adaptation methods within the context of genetic counseling. Furthermore, we will establish a dataset in which experts can evaluate responses generated by LLMs adapted with various domain adaptation methods to gather expert feedback for the future development of genetic counseling LLMs.

Methods:

Our study utilized two main datasets: (1) a question-answering (QA) dataset for LLM adaptation and (2) A genetic counseling question dataset for evaluation. The QA dataset comprised 899 pairs covering topics in medicine and genetic counseling, whereas the evaluation dataset comprised 120 refined questions across six genetic counseling categories. Three domain adaptation methods— instruction tuning, retrieval-augmented generation (RAG), and prompt engineering—were applied to a lightweight Japanese LLM. The performance of the adapted LLM was evaluated using a dataset of 120 carefully selected questions on genetic counseling. Two certified genetic counselors and one ophthalmologist assessed the responses generated by the LLM based on four key metrics: (1) inappropriateness of information, (2) sufficiency of information, (3) severity of harm, and (4) alignment with medical consensus.

Results:

The evaluation conducted by certified genetic counselors and ophthalmologist revealed varied outcomes across different domain adaptation methods. RAG demonstrated promising results, particularly in enhancing key aspects of genetic counseling. Conversely, instruction tuning and prompt engineering yielded less favorable outcomes. This evaluation process led to the construction of a dataset of expert-evaluated responses generated by LLMs adapted using various combinations of these methods. Error analysis highlighted critical ethical concerns, such as the inappropriate promotion of prenatal testing, criticism of relatives, and inaccurate probability statements.

Conclusions:

RAG has significantly improved performance in all evaluation criteria, with the potential for further enhancement through the expansion of RAG data. Our expert-evaluated dataset offers valuable insights into future developments. However, the ethical issues identified in LLM responses underscore the importance of continued refinement and careful ethical considerations prior to the implementation of these systems in healthcare settings.


 Citation

Please cite as:

Fukushima T, Manabe M, Yada S, Wakamiya S, Yoshida A, Urakawa Y, Maeda A, Kan S, Takahashi M, Aramaki E

Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset

JMIR Med Inform 2025;13:e65047

DOI: 10.2196/65047

PMID: 39819819

PMCID: 11783024

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.