Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 12, 2024
Date Accepted: Nov 17, 2024
The social construction of categorical data: A mixed-methods approach to assessing data features in publicly available machine learning datasets
ABSTRACT
Background:
In data-sparse areas, such as healthcare, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning model’s outputs. Therefore, they increasingly work with machine learning models that fuse multiple types and categories of data to create predictive outputs, for example, medical images and metadata, such as text-based medical information. However, the effects of including metadata features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population.
Objective:
This study proposes a mixed methods approach to examining the impacts of metadata categories for machine learning training in medical imaging.
Methods:
Our approach fuses quantitative and qualitative analysis. Exemplarily, we apply our approach to a Brazilian dermatological dataset (PAD-UFES 20): We present an exploratory, quantitative study that assesses the effects when including or excluding each of the unique metadata categories of the PAD-UFES 20 dataset for training a transformer-based model using a data fusion algorithm. We pair our quantitative analysis with a qualitative examination of the metadata categories based on an interview with the dataset author.
Results:
Our study shows how scattered the effects of including different metadata categories for training can be across the predictive classes of a medical imaging model. Our findings highlight the social constructedness of certain metadata categories in publicly available datasets, meaning that the data in a category heavily depends on both how these categories are defined by the dataset creators and the socio-medico context in which the data is collected. This social constructedness creates a context dependency and thus impairs the transferability of some metadata categories to new models, revealing the limitations of metadata categories’ applicability to different contexts.
Conclusions:
We conclude that social scientific, context-dependent analysis of available metadata features with both quantitative and qualitative methods is helpful in judging the utility of metadata categories for the population for which a model is intended.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.