Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jun 22, 2023
Date Accepted: Apr 24, 2024

The final, peer-reviewed published version of this preprint can be found here:

Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics

Meczner A, Cohen N, Qureshi A, Reza M, Blount E, Malak T

Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics

JMIR Form Res 2024;8:e49907

DOI: 10.2196/49907

PMID: 38820578

PMCID: 11179013

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Accuracy as a composite measure for the assessment of online symptom checkers in vignette studies: Evaluation of current practice and recommendations

  • Andras Meczner; 
  • Nathan Cohen; 
  • Aleem Qureshi; 
  • Maria Reza; 
  • Emily Blount; 
  • Tamer Malak

ABSTRACT

Background:

The rapid growth in online symptom checkers (OSC) is not being matched by advances in quality assurance. Currently there are no universally accepted criteria to assess the performance of OSCs. Vignette studies measuring accuracy as a composite metric have been widely used to evaluate OSCs. In contrast to clinical studies, vignette studies have a small number of testers. Hence, they may not be capable of providing a composite metric of performance due to testers’ variability in interpretation of symptoms and interaction with OSCs.

Objective:

(1)To investigate whether crude accuracy is a reliable composite metric in vignette studies. (2)To investigate the feasibility of measuring isolated aspects of performance.

Methods:

Healthily’s OSC was assessed with 114 vignettes, by three groups of three testers, who processed vignettes with different instructions: (a) free interpretation of vignettes (free testers), (b) specified chief complaint(s) (partially free testers) and (c) specified chief complaint(s) with strict instruction for answering additional symptoms (restricted testers). Kappa statistics were calculated to assess agreement of top outcome condition and recommended triage. Consultations were reviewed to assess if they were identical to the vignette and accuracy was measured against a Gold standard. A feasibility study for assessing comprehension of OSC was performed using 51 chief complaints with multiple variations across three OSCs.

Results:

Inter-tester agreement of most likely condition and triage respectively for (a) free tester group were 0.49 and 0.51, (b) partially free group were 0.66 and 0.66, and (c) restricted group were 0.72 and 0.71. Accuracy of the individual restricted testers for the most likely condition ranged from 44.7% to 57.02% with an average of 50.88%, increasing to 57.02% after review and selection of most accurate consultations. Assessing comprehension accuracy was feasible for all three types of OSCs achieving an accuracy between 52% and 64%

Conclusions:

Our study demonstrates linear improvement in agreement of outcome between testers with increasing standardisation of the testing process using vignettes. However, significant variability remains, reflected in varying crude accuracy. Reviewing cases revealed that certain aspects of the symptom interpretation and the tester-OSC interaction is difficult to standardise. Therefore, crude accuracy to assess OSCs is not an adequate composite measure in vignette studies with a small number of testers. However, this does not detract from the need for vignette studies as they have multiple advantages but they should aim to measure isolated functions of an OSC. We recommend a review and selection process whereby the tester-OSC interaction is excluded from the measurement and outcome accuracy becomes a pure reflection of an OSC’s data/algorithm quality. In addition, we have demonstrated symptom comprehension can be feasibly assessed as an isolated metric.


 Citation

Please cite as:

Meczner A, Cohen N, Qureshi A, Reza M, Blount E, Malak T

Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics

JMIR Form Res 2024;8:e49907

DOI: 10.2196/49907

PMID: 38820578

PMCID: 11179013

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.