Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Aug 31, 2020
Date Accepted: Jan 23, 2021
A Framework for Hierarchical Annotation of Unstructured Electronic Health Records and Integration into Standardized Medical Database: SOCRATex
ABSTRACT
Background:
Although electronic health records (EHRs) have been widely used for secondary use of EHRs, clinical documents are relatively less utilized due to the lack of standardized clinical text frameworks across different institutions.
Objective:
To develop a framework for processing unstructured clinical documents of EHRs and integration with standardized structured data.
Methods:
We developed a framework known as Staged Optimization of Curation, Regularization, and Annotation of clinical text (SOCRATex). SOCRATex has four components: (1) clinical document extraction and preprocessing using text mining methods, (2) defining annotation schema with hierarchical structure (3) document-level annotation using the annotation schema, and (4) indexing the annotations for search engine system. To test the usability of the proposed framework, proof-of-concept studies were performed on EHRs. We defined three distinctive patient groups and extracted their clinical documents (i.e., pathology reports, radiology reports, and admission notes). The documents were annotated and integrated into the Observational Medical Outcomes Partnership (OMOP)-common data model (CDM) database. The annotations were used for performing Cox proportional hazard models with different settings of clinical analyses, measuring 1) all-cause mortality, 2) thyroid cancer recurrence, and 3) 30-day hospital readmission.
Results:
Overall, 1,055 clinical documents of 923 patients were extracted and annotated using the defined annotation schemas. The generated annotations were indexed into an unstructured textual data repository. Using the annotations of pathology reports, we identified that node metastasis and lymphovascular invasion of tumor were associated with all-cause mortality of colon and rectum cancer patients (All P <0.05). The other analyses of measuring thyroid cancer recurrence using radiology reports and 30-day hospital readmission using admission notes of depressive disorder patients were also showing consistent results with previously known knowledge.
Conclusions:
We propose a framework for hierarchical annotation of textual data and integration into the standardized OMOP-CDM medical database. The proof-of-concept studies demonstrated that our framework can effectively process and integrate diverse clinical documents with standardized structured data for clinical research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.