Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 27, 2022
Date Accepted: Apr 14, 2022
Readability of English, German, and Russian Disease-related Wikipedia pages: Automated Computational Analysis
ABSTRACT
Background:
Wikipedia is a popular encyclopedia for health- and disease-related information in which patients seek advice and guidance online. However, Wikipedia articles can be written by anyone. Moreo-ver, an article’s readability is not assessed prior to publishing, which can negatively impact pa-tients who are in need for disease- or health-related information or even discourage them from seeking further material. Therefore, Wikipedia articles can be unsuited as patient education mate-rial, as investigated by previous studies that analyzed specific diseases or medical topics with a comparatively small sample size. As of today, no data is available on average readability levels of all disease-related Wikipedia pages for different localizations of this particular encyclopedia.
Objective:
This study aimed to analyze disease-related Wikipedia pages written in English, German and Russian using well-established readability metrics for each language.
Methods:
Wikipedia database snapshots and Wikidata meta-data information were chosen as resources for the collection of data. Disease-related articles were retrieved separately for English, German, and Russian starting with main concept Human Diseases and Disorders (German: Krankheit, Rus-sian: Заболевания человека). In case existent, corresponding ICD-10 codes were retrieved for each article. Next, raw texts were extracted, and readability metrics computed. To test whether the difficulty of the text differs between these groups, two-sided t-tests were conducted to differ-ent ICD-10 chapters and pairwise to articles with different languages.
Results:
The number of articles included in this study for English (EN), German (DE) and Russian (RU) Wikipedia were n_EN=6,127, n_DE=6,024 and n_RU=3,314 respectively. Most disease-related articles have an FRE score lower than 50.00, signaling a difficult or very difficult educational material (EN: 96.93% (5,937/6,125); DE: 99.70% (6,004/6,022); RU: 79.90% (2,647/3,313)). Seven out of ten analyzed articles could be assigned an ICD-10 code with certainty (EN: 69.12% (4,235/6,127); DE: 76.78% (4,625/6,024); RU: 69.89% (2,316/3,314)). For articles with ICD-10 codes, the mean FRE scores were: 〖FRE〗_EN=28.69, 〖FRE〗_DE=20.33, 〖FRE〗_RU=38.54. Nine English ICD-10 chapters (DE: eleven; RU: ten) showed significant differences: F (FRE=23.88; P<.001), E (FRE=25.14; P<.001), H (FRE=30.04; P=.049), I (FRE=30.05; P=.04), M (FRE=31.17; P<.001), T (FRE=32.06; P=.001), A (FRE=32.63; P<.001), B (FRE=33.24; P<.001), S (FRE = 39.02; P < .001).
Conclusions:
Disease-related English, German and Russian Wikipedia articles cannot be recommended as patient education material, as a major fraction is difficult or very difficult to read. Authors of Wikipedia pages should carefully revise existing text materials for readers with a specific interest in a disease or associated symptoms. Special attention should be given to articles on ‘Mental, Behavioral and Neurodevelopmental disorders’ (ICD-10 chapter F) as these articles were most difficult to read (FKGL=15.33, ARI=13.87, CLI=15.49, SMOG=17.22, Gunning FOG=16.92) in comparison to other ICD-10 chapters. Readers should be supported by providing a short and easy to read summary for each article.
Citation
Request queued. Please wait while the file is being generated. It may take some time.