JMIR Preprints #12596: Identifying clinical terms in free-text notes using ontology-guided machine learning

Current Preprint Settings

(as selected by the authors)

1. Allow access to the preprint PDF upon submission to:

(a) Open peer-review purposes
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) Nobody

2. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) Nobody

3. When a final paper is published in a JMIR journal, display the preprint as follows:

(a) Allow download
(b) Show abstract only
(c) Do not display anything

4. If the paper is rejected from JMIR journals, display the preprint to:

(a) Logged-in users only
(b) Anybody, anytime
(c) Nobody

Identifying clinical terms in free-text notes using ontology-guided machine learning

Aryan Arbabi;
David R Adams;
Sanja Fidler;
Michael Brudno

ABSTRACT

Background:

Automatic recognition of medical concepts in unstructured text is an important component of many clinical and research applications and its accuracy has a large impact on electronic health record analysis. The mining of such terms is complicated by the broad use of synonyms and non-standard terms in medical documents.

Objective:

Here we presented a machine learning model for concept recognition in large unstructured text which optimizes the use of ontological structures, and can identify previously unobserved synonyms for concepts in the ontology.

Methods:

We present a neural dictionary model which can be used to predict if a phrase is synonymous to a concept in a reference ontology. Our model uses a convolutional neural network and utilizes the taxonomy structure to encode an input phrase and ranks medical concepts based on the similarity in that space. It also utilizes the biomedical ontology structure to optimize the embedding of various terms, and has fewer training constrains than previous methods. We train our model on two biomedical ontologies, the Human Phenotype Ontology (HPO) and SNOMED-CT. Our code is available (open source) at https://github.com/ccmbioinfo/NeuralCR.

Results:

We tested our model trained on HPO on two different data sets: 288 annotated PubMed abstracts and 39 clinical reports. We also tested our model trained on the SNOMED-CT on 2000 MIMIC-III ICU discharge summaries. The results of our experiments show the high accuracy of our model, as well as the value of utilizing the taxonomy structure of the ontology in concept recognition.

Conclusions:

While the application of machine learning methods to identification of clinical terms in unstructured free text has been hampered by the lack of training data and difficulty identifying novel synonyms for terms in the ontology, our work utilizes machine learning approaches that allow for synonym identification, and the use of orthogonal, unlabelled biomedical corpa. Without any custom training, our model performs as well or better than state-of-the-art models custom built for specific ontologies.