Currently submitted to: JMIR Medical Informatics
Date Submitted: Jun 2, 2026
Open Peer Review Period: Jun 17, 2026 - Aug 12, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Data Readiness of Inpatient Discharge Abstracts for Pulmonary Tuberculosis With Hemoptysis: A Retrospective Computable Phenotyping and Label Validation Study
ABSTRACT
Background:
Pulmonary tuberculosis with hemoptysis is clinically heterogeneous and may involve active respiratory tuberculosis, post-tuberculosis structural lung disease, secondary infection, comorbidity-related treatment constraints, and interventional care pathways. Routinely collected inpatient discharge abstracts are scalable but were not designed for granular hemoptysis research.
Objective:
This study evaluated whether inpatient discharge abstracts can support cohort indexing, computable first-layer phenotype labels, operational escalation-event labels, and data-readiness assessment for hospitalized pulmonary tuberculosis with hemoptysis, while identifying downstream research tasks that require linkage to richer electronic health record data.
Methods:
We conducted a single-center retrospective data-readiness, computable label-generation, and label-validation study using inpatient discharge abstracts from Shenzhen Third People's Hospital from January 1, 2010, to April 30, 2026. The primary unit of analysis was hospitalization. Hospitalizations were eligible when the source discharge-abstract query identified a tuberculosis-related diagnosis together with hemoptysis or ICD code R04.2. Non-mutually exclusive labels were generated from diagnosis fields, procedure names, rescue records, transfusion fields or costs, and discharge-status outcomes. A stratified sample of 200 hospitalization-level records from the extracted cohort was used for validation of selected labels, with oversampling of escalation-related labels. Positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, and Cohen's kappa were calculated where denominators were estimable. Exploratory multivariable logistic regression was used only as a first-layer label co-occurrence analysis, with robust standard errors clustered by medical record number.
Results:
The cohort included 8443 hospitalizations corresponding to 7161 unique medical record numbers; 791 medical record numbers had repeated hospitalizations. Male sex accounted for 6201 hospitalizations (73.4%), median age was 41.0 years (IQR 27.0-56.0), and median length of stay was 10.0 days (IQR 7.0-14.0). A15/A16 respiratory tuberculosis was the primary diagnosis in 6314 hospitalizations (74.8%). Discharge-abstract-derived labels identified post-tuberculosis or structural lung disease in 2622 hospitalizations (31.1%), pulmonary infection-inflammation in 3033 (35.9%), and a diagnosis-derived systemic-complexity flag in 3352 (39.7%). The composite escalation-event label occurred in 1555 hospitalizations (18.4%). Within the extracted validation sample, cohort inclusion was confirmed in all 200 reviewed records. The composite escalation-event label had 120 true positives and 80 true negatives, with PPV, NPV, sensitivity, and specificity all equal to 1.000 within the sampled validation set. Pulmonary infection-inflammation had 1 false negative, with sensitivity 0.988 and kappa 0.990. The diagnosis-derived systemic-complexity flag had 92 evaluable records after exclusion of unclear classifications, highlighting the need for cautious interpretation of diagnosis-derived comorbidity timing. Because the validation sample was stratified and enriched for escalation-related labels, validation estimates should be interpreted as label-level operational agreement within the sampled cohort rather than as hospital-wide source-query performance.
Conclusions:
In this specialized hospital setting, inpatient discharge abstracts could be reused as a first-layer indexing and computable label-generation source for hospitalized pulmonary tuberculosis with hemoptysis. However, discharge abstracts alone were not data ready for standardized hemoptysis severity grading, hemostatic treatment-effectiveness evaluation, or multimodal risk prediction. Their primary value lies in constructing eligible hospitalization episodes, generating transparent first-layer labels, and defining data-linkage requirements for medication, nursing, physiologic, laboratory, imaging, interventional, and longitudinal follow-up data.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.