JMIR Preprints #25530: Similarity-based Unsupervised Spelling Correction Using BioWordVec for Bacteria Culture Reports

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Similarity-based Unsupervised Spelling Correction Using BioWordVec for Bacteria Culture Reports

Tae Hyeong Kim;
Min Ji Kang;
Se Ha Lee;
Jong-Ho Kim;
Hyung Joon Joo;
Sung Won Han;
Jang Wook Sohn

ABSTRACT

Background:

Existing bacterial culture test results for infectious diseases are written in unrefined text, resulting in many problems including typographical errors and stop words. Effective spelling correction processes are needed to ensure the accuracy and reliability of data for the study of infectious diseases, including medical terminology extraction. If a dictionary is established, spelling algorithms using edit distance are efficient. However, in the absence of dictionaries, traditional spelling correction algorithms that utilize only edit distances have limitations.

Objective:

In this research, we proposed a similarity-based spelling correction algorithm using pre-trained word embedding with the BioWordVec technique. This method uses a character-level N-grams-based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place.

Methods:

For detected typographical errors not mapped to SNOMED clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pre-trained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, the grid search is used to search for candidate groups of similar words. Then, the correction candidate words are ranked in consideration of the frequency of the words, and the typos are finally corrected according to the ranking.

Results:

Bacteria identification words were extracted from 27,544 bacteria culture reports, and 16 types of 914 spelling errors were found. The similarity-based spelling correction algorithm using BioWordVec proposed in this research corrected 12 types of typographical errors and showed very high performance in correcting 99.45% of all spelling errors.

Conclusions:

This tool corrected spelling errors effectively in the absence of a dictionary based on bacterial identification words in the bacteria culture reports. This method will help build a high-quality refined database of vast text data for electronic health records.

Citation

Please cite as:

Kim TH, Kang MJ, Lee SH, Kim JH, Joo HJ, Han SW, Sohn JW

Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports

JMIR Med Inform 2021;9(2):e25530

DOI: 10.2196/25530

PMID: 33616536

PMCID: 7939936

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 7, 2020

Date Accepted: Jan 20, 2021

Similarity-based Unsupervised Spelling Correction Using BioWordVec for Bacteria Culture Reports

ABSTRACT

Citation

Copyright