Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jun 23, 2024
Date Accepted: Nov 24, 2024
ChatGPT for Data Analysis: Your Statistics Doula
ABSTRACT
Background:
Prior to the release of Code Interpreter, ChatGPT had integrated knowledge archived in the world wide web. Since its release, Code Interpreter added the utility of data analysis to ChatGPT. The associated analytical tools could democratize access to statistical analysis for all researchers.
Objective:
The goal of this study is to provide researchers with a framework for applying ChatGPT to data management tasks, descriptive statistics, and inferential statistics.
Methods:
A subset of the National Inpatient Sample was extracted. Data analysis trials were divided into data processing, categorization, and tabulation as well as descriptive and inferential statistics. For data processing, categorization, and tabulation assessments, ChatGPT was prompted to reclassify variables, subset variables, and present data, respectively. Descriptive statistics assessments included mean, standard deviation, median, and interquartile range calculations. Inferential statistics assessments were conducted at varying levels of prompt specificity and included Chi square, Pearson correlation, Independent two-sample t-test, One-way ANOVA, Fisher’s exact, Spearman correlation, Mann-Whitney U test, and Kruskal-Wallis H test. Outcomes from consecutive prompt-based trials were assessed against expected statistical values calculated in SAS and R-Studio.
Results:
ChatGPT accurately performed data processing, categorization, and tabulation across all trials. For descriptive statistics, it provided accurate means, standard deviations, medians, and interquartile ranges across all trials. Inferential statistics accuracy against expected statistical values varied with prompt specificity: “Basic” prompts at 46.3% accuracy, ”Intermediate” at 85.0%, and “Advanced” at 92.5%.
Conclusions:
ChatGPT shows promise as a tool for exploratory data analysis, particularly for researchers with some statistical knowledge and limited programming expertise. However, its application requires careful prompt construction and human oversight to ensure accuracy. As a supplementary tool, ChatGPT can enhance data analysis efficiency and broaden research accessibility.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.