Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Sep 25, 2018
Open Peer Review Period: Sep 30, 2018 - Nov 11, 2018
Date Accepted: Apr 21, 2019
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Word Embedding for the French Natural Language in Health Care: Comparative Study

Dynomant E, Lelong R, Dahamna B, Massonnaud C, Kerdelhué G, Grosjean J, Canu S, Darmoni SJ

Word Embedding for the French Natural Language in Health Care: Comparative Study

JMIR Med Inform 2019;7(3):e12310

DOI: 10.2196/12310

PMID: 31359873

PMCID: 6690161

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Word Embedding for the French Natural Language in Health Care: Comparative Study

  • Emeric Dynomant; 
  • Romain Lelong; 
  • Badisse Dahamna; 
  • Clément Massonnaud; 
  • Gaétan Kerdelhué; 
  • Julien Grosjean; 
  • Stéphane Canu; 
  • Stefan J Darmoni

Background:

Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset.

Objective:

The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator.

Methods:

Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization.

Results:

Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture.

Conclusions:

Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.


 Citation

Please cite as:

Dynomant E, Lelong R, Dahamna B, Massonnaud C, Kerdelhué G, Grosjean J, Canu S, Darmoni SJ

Word Embedding for the French Natural Language in Health Care: Comparative Study

JMIR Med Inform 2019;7(3):e12310

DOI: 10.2196/12310

PMID: 31359873

PMCID: 6690161

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.