Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 2, 2024
Date Accepted: Dec 9, 2024
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
InfectA-Chat: Large Language Model for Infectious Diseases in Arabic Language
ABSTRACT
Background:
Infectious diseases have consistently been a significant concern in public health, requiring proactive measures to safeguard societal well-being. In this regard, regular monitoring activities play a crucial role in mitigating the adverse effects of diseases on society. To monitor disease trends, various platforms, such as the World Health Organization (WHO) and the European Centre for Disease Prevention and Control (ECDC), collect diverse surveillance data and make it publicly accessible. However, these platforms primarily present surveillance data in English, which brings language barriers for non-English-speaking individuals and global public health efforts to accurately observe disease trends. This challenge is particularly noticeable in regions such as the Middle East, where specific infectious diseases like MERS-CoV have seen dramatic increases. For such regions, it is essential to develop tools that can overcome language barriers and reach more individuals to alleviate the negative impacts of these diseases.
Objective:
To address these issues, we propose InfectA-Chat, a cutting-edge large language model specifically designed for Arabic, but also incorporates English for enhanced information relevance. InfectA-Chat leverages its deep understanding of the language to provide users with information on the latest trends in infectious diseases based on their queries.
Methods:
This comprehensive study was achieved by instruction tuning the AceGPT-7B and AceGPT-7B-Chat models on a Question & Answering task, utilizing a dataset of 55,400 Arabic and English domain-specific instruction-following data. The performance of these fine-tuned models was evaluated using 2,770 domain-specific Arabic and English instruction-following data, employing the GPT-4 evaluation method. A comparative analysis was then performed against Arabic LLMs and state-of-the-art models, including AceGPT-13B-Chat, Jais-13B-Chat, Gemini, GPT-3.5, and GPT-4. Furthermore, to ensure the model has access to the latest information on infectious diseases by regularly updating the data without additional fine-tuning, we employed the Retrieval-Augmented Generation (RAG) method.
Results:
InfectA-Chat demonstrated significant performance in answering questions about infectious diseases by GPT-4 evaluation method. Our comparative analysis revealed that it outperforms the AceGPT-7B-Chat and InfectA-Chat (based on AceGPT-7B) models by a margin of 48.5%. It also surpassed other Arabic LLMs like AceGPT-13B-Chat and Jais-13B-Chat by 52.3%. Among state-of-the-art models, InfectA-Chat achieved a leading performance of 27.2%, competing closely with the GPT-4 model. Furthermore, the Retrieval-Augmented Generation (RAG) method within InfectA-Chat significantly improves document retrieval accuracy. Notably, RAG retrieved more accurate documents based on queries when the top-k parameter value is increased.
Conclusions:
Our findings highlight the shortcomings of general Arabic LLMs in providing up-to-date information about infectious diseases. With this study, we aim to empower individuals and public health efforts by offering a bilingual Q&A system for infectious disease monitoring. Clinical Trial: None declared.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.