JMIR Preprints #59452: The social construction of categorical data: A mixed-methods approach to assessing data features in publicly available machine learning datasets

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

The social construction of categorical data: A mixed-methods approach to assessing data features in publicly available machine learning datasets

Theresa Willem;
Alessandro Wollek;
Theodor Cheslerean-Boghiu;
Martha Kenney;
Alena Buyx

ABSTRACT

Background:

In data-sparse areas, such as healthcare, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning model’s outputs. Therefore, they increasingly work with machine learning models that fuse multiple types and categories of data to create predictive outputs, for example, medical images and metadata, such as text-based medical information. However, the effects of including metadata features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population.

Objective:

This study proposes a mixed methods approach to examining the impacts of metadata categories for machine learning training in medical imaging.

Methods:

Our approach fuses quantitative and qualitative analysis. Exemplarily, we apply our approach to a Brazilian dermatological dataset (PAD-UFES 20): We present an exploratory, quantitative study that assesses the effects when including or excluding each of the unique metadata categories of the PAD-UFES 20 dataset for training a transformer-based model using a data fusion algorithm. We pair our quantitative analysis with a qualitative examination of the metadata categories based on an interview with the dataset author.

Results:

Our study shows how scattered the effects of including different metadata categories for training can be across the predictive classes of a medical imaging model. Our findings highlight the social constructedness of certain metadata categories in publicly available datasets, meaning that the data in a category heavily depends on both how these categories are defined by the dataset creators and the socio-medico context in which the data is collected. This social constructedness creates a context dependency and thus impairs the transferability of some metadata categories to new models, revealing the limitations of metadata categories’ applicability to different contexts.

Conclusions:

We conclude that social scientific, context-dependent analysis of available metadata features with both quantitative and qualitative methods is helpful in judging the utility of metadata categories for the population for which a model is intended.

Citation

Please cite as:

Willem T, Wollek A, Cheslerean-Boghiu T, Kenney M, Buyx A

The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets

JMIR Med Inform 2025;13:e59452

DOI: 10.2196/59452

PMID: 39874567

PMCID: 11815297

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 12, 2024

Date Accepted: Nov 17, 2024

The social construction of categorical data: A mixed-methods approach to assessing data features in publicly available machine learning datasets

ABSTRACT

Citation

Copyright