ChatGPT vs. Scholars: A comparative examination of literature reviews conducting by humans and AI
ABSTRACT
Background:
With the rapid evolution of artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT-4, there is increasing interest in their potential to assist in scholarly tasks, including conducting literature reviews. However, the efficacy of AI-generated reviews compared to traditional human-led approaches remains underexplored.
Objective:
This study aims to compare the quality of literature reviews conducted by OpenAI's ChatGPT-4 model with those conducted by human researchers, focusing on the relational dynamics between physicians and patients.
Methods:
The study included two literature reviews on the same topic namely, exploring factors affecting relational dynamics between physicians and patients in medico-legal contexts. One review used OpenAI's GPT-4, last updated in September 2021, and the other was conducted by human researchers. The human review involved a comprehensive literature search using medical subject headings and keywords in Ovid MEDLINE followed with a thematic analysis of the literature to synthesize information from selected articles. The AI-generated review employed a new prompt engineering approach, using iterative and sequential prompts to generate results. Comparative analysis was based on qualitative measures such as accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency.
Results:
GPT-4 produced an extensive list of relational factors rapidly. The AI model demonstrated an impressive breadth of knowledge but exhibited limitations in depth and contextual understanding, occasionally producing irrelevant or incorrect information. In comparison, human researchers provided a more nuanced and contextually relevant review. The comparative analysis assessed the reviews based on criteria including accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency. While GPT-4 showed advantages in response time and breadth of knowledge, human-led reviews excelled in accuracy, depth of knowledge, and contextual understanding.
Conclusions:
The study suggests that GPT-4, with structured prompt engineering, can be a valuable tool for conducting preliminary literature reviews by providing a broad overview of topics quickly. However, its limitations necessitate careful expert evaluation and refinement, making it an assistant rather than a substitute for human expertise in comprehensive literature reviews. Moreover, this research highlights the potential and limitations of using AI tools like GPT-4 in academic research, particularly in the fields of health services and medical research. It underscores the necessity of combining AI's rapid information retrieval capabilities with human expertise for more accurate and contextually rich scholarly outputs.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.