Currently submitted to: Journal of Medical Internet Research
Date Submitted: Mar 4, 2026
Open Peer Review Period: Mar 5, 2026 - Apr 30, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Evaluating the Quality of AI-generated outputs for textual analysis with GRACE (Grounded Review and Assessment of Computational Evidence): A Comparative Evaluation of Ten Methods including Topic Modelling, Deep Learning, and Large Language Models
ABSTRACT
Background:
The rapid growth of digital technologies has generated large volumes of free-text data across healthcare, public health, and social research. These contain contextualised accounts of lived experience that are often absent from quantitative measures. Despite their value, these data remain underused because manual qualitative analysis is traditionally designed for in-depth work on smaller numbers of longer transcripts and is difficult to scale. Computational methods, including topic modelling and large language models, are increasingly promoted as efficient solutions. However, concerns persist regarding interpretability, bias, hallucinations, and loss of contextual depth. Critically, there is no established human-centred framework for evaluating the quality of machine-generated outputs, despite qualitative research’s longstanding emphasis on reflexivity, nuance, and meaning-making.
Objective:
1) To develop an AI evaluation framework for assessing machine-generated outputs; 2) Evaluate different machine learning approaches, including classic natural language processing (latent Dirichlet allocation, LDA), a deep learning method (BERTopic), and more recent generative AI (LLaMA-3, Copilot, DeepSeek).
Methods:
We developed and applied a human-centred evaluation framework, GRACE (Grounded Review and Assessment of Computational Evidence), to assess the quality of free-text outputs from approaches using machine learning. GRACE was derived from established qualitative appraisal tools and operationalised four core indicators: interpretability, actionability, nuance, and redundancy, using structured scoring and reflexive consensus. We compared classic probabilistic topic modelling (LDA); a deep learning embedding-based approach (BERTopic); and three large language models (LLMs: LLaMA-3, Copilot, DeepSeek), used alone or in combination with prior structural topic modelling (STM). These were applied to the same corpus (n = 1,044 free-text responses). LLM prompting was iteratively refined, with a single-shot STM-based configuration selected for final evaluation due to reduced hallucinations. All outputs were analysed within a Machine-Assisted Topic Analysis workflow. A rapid manual thematic analysis of a 15% subsample (n = 152) served as a pragmatic comparator.
Results:
Model outputs were variable, with different natural language processing (NLP) methods producing different results from the same dataset. GRACE evaluation indicated that LDA achieved the highest overall mean score (2.6/5), followed by BERTopic and topic modelling plus Copilot (2.5), topic modelling plus LLaMA-3 (2.2), and topic modelling plus DeepSeek (1.9). LDA generated broader conceptual patterns requiring interpretive refinement; while BERTopic produced narrower, more descriptive clusters with thematic overlap. LLM-only outputs were very poor, but a combination of topic modelling and LLMs performed better: the outputs from the latter were well structured but often superficial and repetitive.
Conclusions:
Computational models produced different interpretations of the same dataset, and performance did not align with technical complexity. Large language models were not suitable for thematic analysis, especially when applied to raw data, generating generalised and sometimes inaccurate outputs. Classical probabilistic modelling, particularly STM within a Machine Assisted Topic Analysis (MATA) workflow, provided the most reliable foundation, but still required human interpretation. We argue that the key issue is not whether a model “works,” but what insights it produces and whether these support meaningful, contextually grounded conclusions. GRACE offers a simple, human-centred framework to support this assessment. We recommend the use of a structured MATA approach.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.