JMIR Preprints #12310: Word embedding for French natural language in healthcare: a comparative study

Current Preprint Settings

(as selected by the authors)

1. Allow access to the preprint PDF upon submission to:

(a) Open peer-review purposes
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) Nobody

2. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) Nobody

3. When a final paper is published in a JMIR journal, display the preprint as follows:

(a) Allow download
(b) Show abstract only
(c) Do not display anything

4. If the paper is rejected from JMIR journals, display the preprint to:

(a) Logged-in users only
(b) Anybody, anytime
(c) Nobody

Word embedding for French natural language in healthcare: a comparative study

Emeric Dynomant;
Romain Lelong;
Badisse Dahamna;
Clément Massonaud;
Gaétan Kerdelhué;
Julien Grosjean;
Stéphane Canu;
Stefan J Darmoni

ABSTRACT

Background:

Word embedding technologies are now used in a wide range of applications. However, no formal evaluation and comparison have been made on models produced by the three most famous implementations (Word2Vec, GloVe and FastText).

Objective:

The goal of this study is to compare embedding implementations on a corpus of documents produced in a working context, by health professionals.

Methods:

Models have been trained on documents coming from the Rouen university hospital. This data is not structured and cover a wide range of documents produced in a clinic (discharge summary, prescriptions ...). Four evaluation tasks have been defined (cosine similarity, odd one, mathematical operations and human formal evaluation) and applied on each model.

Results:

Word2Vec had the highest score for three of the four tasks (mathematical operations, odd one similarity and human validation), particularly regarding the Skip-Gram architecture.

Conclusions:

Even if this implementation had the best rate, each model has its own qualities and defects, like the training time which is very short for GloVe or morphosyntaxic similarity conservation observed with FastText. Models and test sets produced by this study will be the first publicly available through a graphical interface to help advance French biomedical research.

Citation

Please cite as:

Dynomant E, Lelong R, Dahamna B, Massonaud C, Kerdelhué G, Grosjean J, Canu S, Darmoni SJ

Word Embedding for the French Natural Language in Health Care: Comparative Study

JMIR Med Inform 2019;7(3):e12310

DOI: 10.2196/12310

PMID: 31359873

PMCID: 6690161

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Sep 25, 2018

Open Peer Review Period: Sep 30, 2018 - Nov 11, 2018

Date Accepted: Apr 21, 2019

(closed for review but you can still tweet)

Word embedding for French natural language in healthcare: a comparative study

ABSTRACT

Citation

Copyright

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Sep 25, 2018

Open Peer Review Period: Sep 30, 2018 - Nov 11, 2018

Date Accepted: Apr 21, 2019

(closed for review but you can still tweet)

Word embedding for French natural language in healthcare: a comparative study

ABSTRACT

Citation

Per the author's request the PDF is not available.

Copyright