JMIR Preprints #39235: Issues with Variability in EHR Data About Race and Ethnicity: A Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Issues with Variability in EHR Data About Race and Ethnicity: A Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave

Lily Cook;
Juan Espinoza;
Nicole G. Weiskopf;
Nisha Mathews;
David A. Dorr;
Kelly L. Gonzales;
Adam Wilcox;
Charisse Madlock-Brown;
on behalf of the N3C Consortium

ABSTRACT

Background:

The adverse impact of COVID-19 on marginalized and under-resourced communities of color has highlighted the need for accurate, comprehensive race and ethnicity data. However, a significant technical challenge related to integrating race and ethnicity data in large, consolidated databases is the lack of consistency in how data about race and ethnicity is collected and structured by healthcare organizations.

Objective:

To evaluate and describe variations in how healthcare systems collect and report information about the race and ethnicity of their patients, and to assess how well these data are integrated when aggregated into a large clinical database.

Methods:

At the time of our analysis, the National COVID Cohort Collaborative (N3C) Data Enclave contained records from 6.5 million patients contributed by 56 healthcare institutions. We quantified the variability in the harmonized race and ethnicity data in the N3C Enclave by analyzing its conformance to healthcare standards for such data. We conducted a descriptive analysis by comparing the harmonized data available for research purposes in the database to the original source data contributed by healthcare institutions. To make the comparison, we tabulated the original source codes, enumerating how many patients had been reported with each encoded value and how many distinct ways each category was reported. The non-conforming data was also cross-tabulated by three factors: patient ethnicity, the number of data partners using each code, and which data models utilized those particular encodings. For the non-conforming data, we used an inductive approach to sort the source encodings into categories. For example, values such as “Declined” were grouped with “Refused”; “Multiple Race” was grouped with “Two or more races” and “Multiracial”, etc.

Results:

“No matching category” was the second largest harmonized racial group in the N3C. 20.7% of the race data did not conform to the standard; the largest category was data that were missing. Hispanic or Latino patients were over-represented in the non-conforming racial data, and data from American Indian or Alaska Native patients were obscured. Although only a small proportion of the source data had not been mapped to the correct concepts (0.6%), Black or African-American and Hispanic/Latino patients were over-represented in this category.

Conclusions:

Differences in how race and ethnicity data is conceptualized and encoded by healthcare institutions can affect the quality of the data in aggregated clinical databases. The impact of data quality issues in the N3C Data Enclave was not equal across all races and ethnicities, which has the potential to introduce bias in analyses and conclusions drawn from these data. Transparency about how data has been transformed can help users make accurate analyses and inferences, and eventually better guide clinical care and public policy.

Citation

Please cite as:

Cook L, Espinoza J, Weiskopf NG, Mathews N, Dorr DA, Gonzales KL, Wilcox A, Madlock-Brown C, on behalf of the N3C Consortium

Issues With Variability in Electronic Health Record Data About Race and Ethnicity: Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave

JMIR Med Inform 2022;10(9):e39235

DOI: 10.2196/39235

PMID: 35917481

PMCID: 9490543

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 9, 2022

Open Peer Review Period: May 3, 2022 - May 31, 2022

Date Accepted: Jul 26, 2022

Date Submitted to PubMed: Aug 2, 2022

(closed for review but you can still tweet)

Issues with Variability in EHR Data About Race and Ethnicity: A Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave

ABSTRACT

Citation

Copyright