Currently submitted to: JMIR Medical Informatics
Date Submitted: Mar 4, 2026
Open Peer Review Period: Mar 16, 2026 - May 11, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Sequential Disease Pattern Mining for Early Risk Signal Detection in Population Health Data: A Nationwide Retrospective Cohort Study
ABSTRACT
Background:
Many diseases often develop through sequences of related conditions over time. Identifying how diagnoses occur over time may help detect early risk signals before severe outcomes arise. Clinically significant patterns are limited due to many population-level studies focus on disease co-occurrence rather than the temporal order of diagnosis.
Objective:
To identify and validate temporal disease associations using frequent pattern mining and statistical validation techniques in a nationwide patient-record database.
Methods:
We analyzed health records of 3,987,382 Finnish patients. The records were transformed into temporal disease sequences by taking the date that the first record was entered into the database. To identify the most common patterns for each unique disease, we applied the FP-Growth algorithm by using a support threshold of 5% or the minimum number of 5 patients. To validate each pattern, we applied a combination of relative risk, 95% confidence interval, and relative width to measure precision.
Results:
We identified several clinically interpretable temporal disease connections. Such as, acute kidney failure was mostly preceded by chronic kidney disease with a RR = 15.13 and sepsis with RR = 9.76. It is also grouped with heart failure-related combinations with RR = 10.76. Patients with diabetic foot ulcer with type 1 or type 2 diabetes have a significant risk of getting osteomyelitis with a relative risk of 157.02 for type 1 and 84.84 for type 2. At the block level, cerebrovascular diseases were linked to hypertension (RR = 2.47), atherosclerosis (RR = 2.52), and dementia (RR = 2.96). Drug poisoning patterns were also connected to psychiatric diagnoses, including mood disorders (RR = 7.24) and combinations of alcohol and mood disorders (RR = 18.25). Across these patterns, confidence intervals were narrow, and relative width values were low. The generated patterns and statistical measures are publicly available in a web interface for research purposes: https://cs.uef.fi/ml/impro/disease-pattern/
Conclusions:
Frequent pattern mining, when integrated with RR/CI and precision filtering, produces clinically interpretable temporal connections that could help in decision-making and hypothesis development. External validation with other datasets and cohorts is essential.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.