Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 30, 2025
Date Accepted: Feb 18, 2026
Date Submitted to PubMed: Feb 20, 2026

The final, peer-reviewed published version of this preprint can be found here:

Understanding User Intent in Code-Mixed Sexual and Reproductive Health Queries in Urban India: Hierarchical Classification Approach Using Large Language Models

Dey SK, S M, Thapa A, Shah M, Mehta Z, Kapile SK, Divate T, Jalota S, Ismail A

Understanding User Intent in Code-Mixed Sexual and Reproductive Health Queries in Urban India: Hierarchical Classification Approach Using Large Language Models

J Med Internet Res 2026;28:e86545

DOI: 10.2196/86545

PMID: 41875054

Understanding User Intent in Code-Mixed Sexual and Reproductive Health Queries in Urban India: A Hierarchical Classification Approach using LLMs

  • Sumon Kanti Dey; 
  • Manvi S; 
  • Aradhana Thapa; 
  • Meet Shah; 
  • Zeel Mehta; 
  • Shraddha Kale Kapile; 
  • Tanvi Divate; 
  • Suhani Jalota; 
  • Azra Ismail

ABSTRACT

Background:

Access to knowledge about sexual and reproductive health (SRH) remains stigmatized and taboo in many parts of the globe. In the Global South, information delivery is further complicated by linguistic and cultural diversity. For instance, in India (our study context), urban Hindi-speaking users frequently type text in Hinglish (code-mixed Hindi and English written in the Latin script) and use colloquial language to describe SRH concerns. Large language models (LLMs) could help answer SRH questions, but most systems are trained for English and struggle with code-mixed text and understanding cultural context. Our research aims to address this gap by focusing on the current state of LLMs in understanding user intent in SRH queries, for a low-resource language.

Objective:

This study evaluates the effectiveness of proprietary, multilingual open-weight, and Indic LLMs in zero-shot settings for identifying user intent in code-mixed Hinglish SRH queries. Our aim is to measure how well LLMs assign the correct label in a two-level hierarchical classification (topic then subtopic). We take a hierarchical approach because SRH queries are complex and context-dependent; flat labels could obscure clinically important distinctions and can lead to misdirected guidance. We also characterize common error types that drive misclassification.

Methods:

We analyzed 4,161 de-identified questions about SRH in Hinglish (Hindi written in Latin script), collected by our partner nonprofit health organization (Myna Mahila Foundation) in an underserved community in urban Mumbai. Queries were annotated into 8 topics and 40 subtopics using a hierarchical framework that captured linguistic, cultural, and context variation. We compared the performance of proprietary, multilingual open-weight, and Indic-specific LLMs in zero-shot settings. Performance was measured using hierarchical F1(hF1), exact match, and topic- and subtopic-level accuracy.

Results:

Proprietary models achieved the strongest results, with GPT-5 performing best overall (hF1 = 0.784). Among open-weight systems, Sarvam-M emerged as the top-performing Indic model (hF1 = 0.757), ranking just below proprietary models and even surpassing Claude-3.5-Sonnet (0.745) as well as large multilingual systems such as LLaMA-3.3-70B-Instruct (0.742) and Gemma-3-27B-IT (0.739). Other Indic models performed considerably lower (e.g., LLaMA3-Gaja-Hindi-8B, 0.596; Krutrim-2-Instruct, 0.558; Airavata, 0.404). Smaller multilingual open-weights models––including Mixtral-8x7B-Instruct (0.593), LLaMA-3.1-8B-Instruct (0.630), Gemma-2-9B-IT (0.657)––consistently outperformed them, showing that parameter size alone does not explain performance gaps. While models generally captured broad topical intent, they frequently failed at fine-grained intent recognition, especially with euphemisms, colloquial expressions, and local and culturally situated questions.

Conclusions:

Hierarchical classification revealed persistent gaps in how LLMs handle code-mixed queries. Proprietary models performed best, but Sarvam-M shows that open-weight Indic systems can achieve near–state-of-the-art performance when supported by robust training data and cultural adaptation. Strengthening such localized fine-tuned models is essential for developing culturally informed, linguistically inclusive AI tools that can expand equitable access to SRH information in underserved populations globally.


 Citation

Please cite as:

Dey SK, S M, Thapa A, Shah M, Mehta Z, Kapile SK, Divate T, Jalota S, Ismail A

Understanding User Intent in Code-Mixed Sexual and Reproductive Health Queries in Urban India: Hierarchical Classification Approach Using Large Language Models

J Med Internet Res 2026;28:e86545

DOI: 10.2196/86545

PMID: 41875054

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.