Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Metrics Used for the Evaluation of Chatbots Providing Cancer Genetic Risk Assessment and Education: A Systematic Review
ABSTRACT
Background:
Chatbots have recently emerged as an alternative approach for delivering cancer risk assessment and genetic counseling. Understanding the metrics used to describe the user-chatbot experience highlights the strengths and weaknesses of AI-assisted healthcare applications, ensuring safe and reliable medical care. While research supports chatbots in cancer genetic risk assessment and counseling, the evaluation measures remain inconsistent and unsystematic.
Objective:
This systematic review analyzes the metrics used to evaluate chatbot platforms providing cancer genetic risk assessment and pre-test and post-test genetic education. We examine these measures to identify potential limitations and inform a more systematic evaluative approach.
Methods:
A comprehensive search was conducted using three databases: PubMed, Web of Science, and Engineering Village. Articles were screened and analyzed using the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) framework. Study and chatbot characteristics were documented, along with variables affecting metric use. Metrics evaluating the user-chatbot experience were extracted, categorized into domains, and organized within the Reach, Effectiveness, Adoption, Implementation, and Maintenance (RE-AIM) framework to identify assessment gaps and insights regarding application and effectiveness.
Results:
This database search retrieved 684 citations, with 11 articles meeting the inclusion criteria. The studies varied in methodologies, research settings, chatbot functionalities, and participants' characteristics. A total of 104 measures were extracted and categorized into 16 groups, with each study utilizing 2 to 22 metrics (median of 8). The measurement groups were organized into five domains: user experience, knowledge acquisition, outcomes and behaviors, emotional response, and technical performance, with user experience measures being the most common. Notably, despite the educational purpose of AI-assisted genetic counseling, the knowledge acquisition domain ranked third in metric usage. The RE-AIM framework illustrated that the study metrics addressed its five dimensions, highlighting user-centric measures omitted from chatbot evaluations, which included accuracy, transparency, data privacy, and educational continuity.
Conclusions:
The limited studies on automated cancer genetic risk assessment and education showed significant variability in the metrics used. A unified evaluation process is essential for accurately assessing chatbot effectiveness. The measurements of knowledge that users gain hold important value, yet they are currently only moderately significant. Expanding educational metrics will strengthen the informed consent process and empower patients in their healthcare decisions. Additionally, recognizing confounding variables and utilizing frameworks such as RE-AIM can help ensure that appropriate measures are properly implemented and not overlooked. These strategies will ultimately promote the safe and effective use of novel genetic services. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.