JMIR Preprints #87465: Differences in Safety Risks across Languages for Health Large Language Models: A Cross-Language Vulnerability Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Differences in Safety Risks across Languages for Health Large Language Models: A Cross-Language Vulnerability Study

Saubhagya Joshi;
Melissa Mendoza;
Yonaira Rivera;
Vivek K. Singh

ABSTRACT

Background:

Large language models (LLMs) such as ChatGPT are increasingly used to support health-related queries and decision-making. However, these models can be “jailbroken” through adversarial prompts that bypass safety filters and elicit harmful or medically inappropriate responses. In healthcare contexts, such vulnerabilities pose serious risks. Understanding how jailbreak susceptibility varies across languages is essential for developing robust safeguards and promoting equitable access to safe health information.

Objective:

This study aims to systematically compare and contrast the vulnerability of a health LLM for jailbreaking across three languages: English, Spanish, and Hindi (transliterated using the Latin alphabet) based on emoji and permutation cipher attacks.

Methods:

We analyzed 1,000 input prompts per language, drawn from the BeaverTails dataset, across three harm categories: self-harm, violence, and drug abuse. Each prompt was modified using emoji and permutation cipher techniques, resulting in 6,000 input-output pairs. Model responses were evaluated by human coders to determine the success rate of jailbreak attempts across languages and cipher types.

Results:

Hindi prompts showed the highest vulnerability, with 787 successful jailbreaks using emoji ciphers and 873 using permutation ciphers. Spanish and English followed, with lower success rates across both cipher types. Differences in jailbreak success across languages and cipher strategies were statistically significant. Additionally, attacks targeting violence-related prompts were more successful overall than those targeting drug-related or self-harm content, indicating variation in vulnerability by harm type.

Conclusions:

The findings of this formative study reveal that LLM safety performance varies substantially across languages and harm categories, raising concerns about equitable protection in multilingual health communication. Disparities in access to harmful content may contribute to downstream health risks. Strengthening multilingual content moderation and developing language-aware safety mechanisms are critical steps toward safer and more inclusive health AI systems.

Citation

Please cite as:

Joshi S, Mendoza M, Rivera Y, Singh VK

Differences in Safety Risks Across Languages in Health-Relevant Queries: Vulnerability Analysis of Large Language Model Responses

JMIR Form Res 2026;10:e87465

DOI: 10.2196/87465

PMID: 42190643

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Nov 9, 2025

Open Peer Review Period: Nov 10, 2025 - Jan 5, 2026

Date Accepted: Mar 5, 2026

(closed for review but you can still tweet)

Differences in Safety Risks across Languages for Health Large Language Models: A Cross-Language Vulnerability Study

ABSTRACT

Citation

Copyright