JMIR Preprints #17787: AlphaBERT: An extractive summarization model based on a character-level token and Bidirectional Encoder Representations from Transformers (BERT)

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

AlphaBERT: An extractive summarization model based on a character-level token and Bidirectional Encoder Representations from Transformers (BERT)

Yen-Pin Chen;
Yi-Ying Chen;
Jr-Jiun Lin;
Chien-Hua Huang;
Feipei Lai

ABSTRACT

Background:

Doctors must care for many patients simultaneously, and it is time-consuming to find and examine all patients’ medical histories. The deep learning method, applied in and Bidirectional Encoder Representations from Transformers (BERT)-based models, is useful for summarization. To manage the problem of medical terminology, the BioBERT model is also included in this study and its performance compared. However, a heavy model is difficult to build using outdated, resource-limited computers across several hospitals. Adopting a character-level token on BERT is a solution to this problem.

Objective:

We aim to build a diagnoses-extractive summarization model for hospital information systems and provide a website service that can be operated with limited computing resources.

Methods:

We collected diagnoses from the National Taiwan University Hospital Integrated Medical Database (NTUH-iMD) and labeled the highlighted extractive summaries written by experienced doctors. We used the BERT-based structure and a two-stage training method. We used a character-level token to reduce the model size and pre-trained the model using random mask characters in the diagnoses and ICD sets, then fine-tuned the model using summary labels. We cleaned up the prediction results by averaging all the probabilities for entire words to prevent character-level-induced fragment words. We evaluated the model performance using the ROUGE score and built a questionnaire website to collect feedback from more doctors for each summary proposal.

Results:

The Area Under the Receiver Operating Characteristics (AUROCs) of the summary proposals were 0.941, 0.928, 0.899, and 0.933 for BioBERT, BERT, LSTM, and the proposed model. The ROUGE-L values were 0.697, 0.711, 0.648, and 0.678 for BioBERT, BERT, LSTM, and the proposed model. The mean critic scores (standard deviations) from doctors were 2.232 (0.832), 2.134 (0.877), 2.207 (0.844), 1.927 (0.910), and 2.126 (0.874) for reference-by-doctor labels, BioBERT, BERT, LSTM, and the proposed model. Using the pairwise paired t-test, there was a statistically significant difference in LSTM compared to the reference (p<.001), BERT (p=.001), BioBERT (p<.001), and the proposed model (p=.002), but not to the other models.

Conclusions:

Using character-level tokens in a BERT model can greatly decrease the model size without significantly reducing performance for the diagnoses summarization task. A well-developed deep learning model will enhance doctors’ abilities and promote medical studies by providing the capability to use extensive unstructured free-text notes. Clinical Trial: None

Citation

Please cite as:

Chen YP, Chen YY, Lin JJ, Huang CH, Lai F

Modified Bidirectional Encoder Representations From Transformers Extractive Summarization Model for Hospital Information Systems Based on Character-Level Tokens (AlphaBERT): Development and Performance Evaluation

JMIR Med Inform 2020;8(4):e17787

DOI: 10.2196/17787

PMID: 32347806

PMCID: 7221648

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 13, 2020

Open Peer Review Period: Jan 13, 2020 - Jan 23, 2020

Date Accepted: Apr 10, 2020

(closed for review but you can still tweet)

AlphaBERT: An extractive summarization model based on a character-level token and Bidirectional Encoder Representations from Transformers (BERT)

ABSTRACT

Citation

Copyright