Accepted for/Published in: JMIR AI
Date Submitted: Jun 27, 2025
Date Accepted: Oct 24, 2025
(closed for review but you can still tweet)
The Effectiveness of ChatGPT, Google Gemini, and Microsoft Copilot in Answering Thai Drug Information Queries: a Cross-sectional Study
ABSTRACT
Background:
Artificial intelligence (AI) chatbots, including ChatGPT-4o, Google Gemini, and Microsoft Copilot, are increasingly utilized to deliver healthcare-related information. Their potential to assist in pharmaceutical care and drug information services is gaining attention globally. However, their ability to provide accurate, complete, and safe drug-related information in non-English contexts, particularly in Thai, remains underexplored.
Objective:
This study aimed to evaluate the performance of these AI systems in responding to drug-related questions written in Thai.
Methods:
An analytical cross-sectional study was conducted using 76 public drug-related questions compiled from medical databases and social media sources between November 1st, 2019, and December 31st, 2024. These questions were categorized into 18 distinct types along with one mixed-type category, with each category comprising four questions (n=19 categories × 4 questions=76). The responses generated by ChatGPT-4o, Google Gemini, and Microsoft Copilot were evaluated in terms of correctness, completeness, risk, and reproducibility. All AI models were queried using identical input text in Thai, and responses were independently assessed by clinical pharmacists using standardized evaluation criteria.
Results:
ChatGPT-4o demonstrated a higher proportion of fully correct responses (50.0%) compared to Microsoft Copilot (35.5%) and Google Gemini (34.2%), although these differences did not reach statistical significance (P=.078). All three AI models provided generally complete responses, with no significant difference in completeness scores among them (P=.080). While high-risk answers were observed across all systems, the overall risk levels were not significantly different (P=.123). The category of drug-related questions significantly influenced the correctness of AI responses (P=.002), but not completeness (P=.230). ChatGPT-4o generally yielded the highest proportion of fully correct and complete answers across most categories. However, in the pharmacology category, Google Gemini and Microsoft Copilot outperformed ChatGPT in correctness. Question type also statistically significantly affected the risk level of the answers (P=.039); in particular, the pregnancy and lactation category showed the highest high-risk response rate (1.32% per system). Regarding reproducibility, all three AI models demonstrated consistent response patterns when the same questions were re-queried after 1, 7, and 14 days, with no significant deviation from the initial responses.
Conclusions:
The evaluated AI chatbots were able to answer the queries with generally complete content; however, we found limited accuracy and occasional high-risk errors in responding to drug-related questions in Thai. However, all models exhibited good reproducibility, with consistent response patterns observed across multiple time points. Further improvements are necessary to provide safe, reliable, and language-specific pharmaceutical information.
Citation
Request queued. Please wait while the file is being generated. It may take some time.