Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Aug 29, 2025
Date Accepted: Nov 18, 2025
Date Submitted to PubMed: Nov 18, 2025

The final, peer-reviewed published version of this preprint can be found here:

Automated Multitier Tagging of Chinese Online Health Education Resources Using a Large Language Model: Development and Validation Study

Meng J, Dai R, Huang X, Gu Y, Yan S, Wang X, Gao J, Zhang T

Automated Multitier Tagging of Chinese Online Health Education Resources Using a Large Language Model: Development and Validation Study

J Med Internet Res 2025;27:e83219

DOI: 10.2196/83219

PMID: 41251541

PMCID: 12756663

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Automated Multi-Tier Tagging of Chinese Online Health Education Materials Using a Large Language Model: Development and Validation Study

  • Jialin Meng; 
  • Ruiming Dai; 
  • Xiaolan Huang; 
  • Yi Gu; 
  • Shixing Yan; 
  • Xiaoke Wang; 
  • Jingrong Gao; 
  • Tiantian Zhang

ABSTRACT

Background:

Effective precision health education and promotion depend on the efficient dissemination of health information. However, current health communication encounters structural bottlenecks, including information overload with insufficient precision matching, variable quality of health resources, and a lack of personalized services. These challenges impede large-scale targeted distribution and audience access. This study aimed to develop and validate an automated tagging system using a large language model (LLM) to enhance the efficiency and equity of health communication and promotion.

Objective:

This study aimed to develop, deploy, and validate an artificial intelligence-driven, multi-tier, automated content tagging system to address the core challenges in managing Chinese health education resources and provide a technical foundation for scalable precision health communication.

Methods:

We developed a health promotion taxonomy with 10 primary, 34 secondary, and 90,562 tertiary tags using a hybrid method combining a top-down approach (aligned with national standards and expert knowledge) and a bottom-up approach (corpus mining). Subsequently, we constructed an automated tagging system for health promotion materials by fine-tuning a Baichuan2-7B LLM with Low-Rank Adaptation (LoRA), then integrated it with a named entity recognition model and a vector database (Chroma DB), and evaluated its performance.

Results:

The final taxonomy included all 16 national priority health domains. The model achieved an overall tag automation rate of 94.8% on the test set, with rates of 97.38% for text-only resources and 89.55% for nontext resources. In a comparative analysis, the model-generated tags demonstrated a higher thematic relevance to the source content than the original manual annotations.

Conclusions:

A fine-tuned LLM can efficiently automate the assignment of a granular multilevel tagging system for Chinese health promotion resources. This approach provides a scalable solution to a key bottleneck in health-information management, establishing a technical foundation for advancing precise health communication and improving equitable access to health information.


 Citation

Please cite as:

Meng J, Dai R, Huang X, Gu Y, Yan S, Wang X, Gao J, Zhang T

Automated Multitier Tagging of Chinese Online Health Education Resources Using a Large Language Model: Development and Validation Study

J Med Internet Res 2025;27:e83219

DOI: 10.2196/83219

PMID: 41251541

PMCID: 12756663

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.