JMIR Preprints #63881: InfectA-Chat: Large Language Model for Infectious Diseases in Arabic Language

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

InfectA-Chat: Large Language Model for Infectious Diseases in Arabic Language

Insung Ahn;
Eunhui Kim;
Yesim Selcuk

ABSTRACT

Background:

Infectious diseases have consistently been a significant concern in public health, requiring proactive measures to safeguard societal well-being. In this regard, regular monitoring activities play a crucial role in mitigating the adverse effects of diseases on society. To monitor disease trends, various platforms, such as the World Health Organization (WHO) and the European Centre for Disease Prevention and Control (ECDC), collect diverse surveillance data and make it publicly accessible. However, these platforms primarily present surveillance data in English, which brings language barriers for non-English-speaking individuals and global public health efforts to accurately observe disease trends. This challenge is particularly noticeable in regions such as the Middle East, where specific infectious diseases like MERS-CoV have seen dramatic increases. For such regions, it is essential to develop tools that can overcome language barriers and reach more individuals to alleviate the negative impacts of these diseases.

Objective:

To address these issues, we propose InfectA-Chat, a cutting-edge large language model specifically designed for Arabic, but also incorporates English for enhanced information relevance. InfectA-Chat leverages its deep understanding of the language to provide users with information on the latest trends in infectious diseases based on their queries.

Methods:

This comprehensive study was achieved by instruction tuning the AceGPT-7B and AceGPT-7B-Chat models on a Question & Answering task, utilizing a dataset of 55,400 Arabic and English domain-specific instruction-following data. The performance of these fine-tuned models was evaluated using 2,770 domain-specific Arabic and English instruction-following data, employing the GPT-4 evaluation method. A comparative analysis was then performed against Arabic LLMs and state-of-the-art models, including AceGPT-13B-Chat, Jais-13B-Chat, Gemini, GPT-3.5, and GPT-4. Furthermore, to ensure the model has access to the latest information on infectious diseases by regularly updating the data without additional fine-tuning, we employed the Retrieval-Augmented Generation (RAG) method.

Results:

InfectA-Chat demonstrated significant performance in answering questions about infectious diseases by GPT-4 evaluation method. Our comparative analysis revealed that it outperforms the AceGPT-7B-Chat and InfectA-Chat (based on AceGPT-7B) models by a margin of 48.5%. It also surpassed other Arabic LLMs like AceGPT-13B-Chat and Jais-13B-Chat by 52.3%. Among state-of-the-art models, InfectA-Chat achieved a leading performance of 27.2%, competing closely with the GPT-4 model. Furthermore, the Retrieval-Augmented Generation (RAG) method within InfectA-Chat significantly improves document retrieval accuracy. Notably, RAG retrieved more accurate documents based on queries when the top-k parameter value is increased.

Conclusions:

Our findings highlight the shortcomings of general Arabic LLMs in providing up-to-date information about infectious diseases. With this study, we aim to empower individuals and public health efforts by offering a bilingual Q&A system for infectious disease monitoring. Clinical Trial: None declared.

Citation

Please cite as:

Ahn I, Kim E, Selcuk Y

InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis

JMIR Med Inform 2025;13:e63881

DOI: 10.2196/63881

PMID: 39928922

PMCID: 11851044

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 2, 2024

Date Accepted: Dec 9, 2024

InfectA-Chat: Large Language Model for Infectious Diseases in Arabic Language

ABSTRACT

Citation

Copyright