Accepted for/Published in: Online Journal of Public Health Informatics
Date Submitted: Nov 22, 2024
Date Accepted: Nov 11, 2025
Application of Machine Learning to Auto-code Injury Data: Design and Preliminary Implementation
ABSTRACT
Background:
The Canadian Hospitals Injury Reporting and Prevention Program (CHIRPP), a Public Health Agency of Canada (PHAC) program established in 1990, is an injury and poisoning sentinel surveillance system that collects and analyzes data on injuries to individuals who are seen at the emergency departments of numerous pediatric and general hospitals in Canada. Since its inception, the program has collected over 4 million records. The program’s surveillance activities have contributed substantially to evidence-based decision-making to reduce injuries, support research and establish preventive safeguards to protect the health and safety of Canadians. Patients presenting at participating hospitals are asked to complete a data collection form capturing the causes and circumstances contributing to the injury or poisoning event. Using this text, hospital and program staff have traditionally coded numerous surveillance variable codes manually for subsequent analysis within e-CHIRPP, the program’s purpose-built analytical application on the Canadian Network for Public Health Intelligence (CNPHI) public health informatics platform. Manual coding of this complex data is administratively burdensome and results in a significant time-lag in the availability of important surveillance findings.
Objective:
With the initial goal of achieving a preliminary stage of implementation, the objective was to establish the capability to achieve enhanced timeliness of surveillance findings within a process of adaptability and continuous improvement by applying machine learning to auto-code injury data based on patient narratives.
Methods:
The research, development and implementation of machine learning and auto-coding within the e-CHIRPP system was led by the CNPHI team in collaboration with the CHIRPP program team. Data was extracted from e-CHIRPP and prepared for training, and candidate algorithms well suited for classification and supervised learning were initially assessed. Subsequently, one algorithm was chosen for further assessment based upon initial accuracy, prediction confidence and training time. The chosen algorithm was then further assessed in two stages, again using e-CHIRPP extracts, firstly for a two-year data set, then again for a seven-year data set. The sources of inaccuracies were investigated with a view to informing the refinement of the overall process and establishing ongoing adaptability and continuous improvement.
Results:
Auto-coding of injury variables showed a high level of accuracy in most cases when compared to variables previously coded manually. Importantly, insights were also gained into the sources of observed inaccuracies and measures to foster ongoing refinement of the process.
Conclusions:
The application of machine learning and auto-coding shows a strong potential to benefit surveillance activities across various public health disciplines, yielding near real-time availability of intelligence, reduced administrative workload, continuous improvement and adaptability to database growth and change.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.