Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jan 27, 2025
Date Accepted: Oct 23, 2025
Balancing Privacy and Utility in Child and Adolescent Mental Health Services Research: A Retrospective Cohort Study on Synthetic Data Generation
ABSTRACT
Background:
High-quality, large-scale healthcare research, especially those using medical records, encounters significant challenges related to technical difficulties and confidentiality issues. As a result, critical research questions about patient evaluation and treatment have been left unanswered. Moreover, the presence of stigma and increased sensitivity surrounding mental health issues have resulted in a significant delay in research progress, particularly concerning Child and Adolescent Mental Health Services (CAMHS).
Objective:
High-quality, large-scale healthcare research, especially those using medical records, encounters significant challenges related to technical difficulties and confidentiality issues. As a result, critical research questions about patient evaluation and treatment have been left unanswered. Moreover, the presence of stigma and increased sensitivity surrounding mental health issues have resulted in a significant delay in research progress, particularly concerning Child and Adolescent Mental Health Services (CAMHS).
Methods:
A CAMHS dataset from Stavanger University Hospital in Norway was divided into two cohorts: a training cohort (n = 6,184 referrals, 58,524 episodes of care) and an independent test set (n = 1,564 referrals, 14,610 episodes of care). A hierarchical synthetic data generation model was used to create synthetic referral periods and associated episodes of care based on real-world CAMHS data. The utility, quality, and privacy risk of the generated synthetic data were then evaluated and reported.
Results:
The study used a CAMHS cohort of 6,924 patients from Stavanger University Hospital, Norway. A synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties similar to real-world data (KS/TVD Complement score = 0.92, CS score = 0.77, CS (Inter-table) score = 0.75, CSS score = 0.92), while demonstrating low privacy risk (average Singleout score (univariate) = 0.17, multivariate = 0.04, Linkability risk = 2.5, inference risk = 0.7). The predictive model trained on synthetic data performed comparably to the model trained on real data for classifying the intensity of care required by patients, while maintaining feature interpretability (for n = 656, 1,546, 3,092, and 6,184, average PR_AUC = 0.32, 0.33, 0.34, and 0.40 respectively, compared to PR_AUC = 0.43 using 6,184 real data records.
Conclusions:
By offering access to extensive and representative samples with a low risk of patient identification, synthetic CAMHS data balances data utility with fairness and privacy protection. This approach not only encourages data sharing but also expands the breadth of research while safeguarding patient privacy. Additionally, it fosters innovation by providing researchers with high-quality data that can be used to develop new treatments and interventions. Furthermore, the use of synthetic data can help overcome barriers related to data access and regulatory constraints, making it easier for researchers to collaborate and share findings across institutions.
Citation
Request queued. Please wait while the file is being generated. It may take some time.