JMIR Preprints #95844: MedGemma on HealthBench: Evaluating Open-Source Medical AI on Consumer Hardware for $39

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

MedGemma on HealthBench: Evaluating Open-Source Medical AI on Consumer Hardware for $39

Ryan Luginbuhl

ABSTRACT

We present, to our knowledge, the first independent evaluation of Google's MedGemma 1.5 4B instruction-tuned model on OpenAI's HealthBench, one of the most comprehensive physician-graded benchmarks for health AI [3]. The evaluation encompassed all 5,000 HealthBench conversations (48,562 rubric criteria created by 262 physicians across 60 countries), run entirely on a consumer laptop (Apple M1 Max, 32GB) using quantized weights (Q8_0), with automated grading via GPT-4.1 mini at a total API cost of $39.21. MedGemma 1.5 4B achieved an overall HealthBench score of 0.4512 (bootstrap SE = 0.0046), placing it ahead of o1 (0.418), Claude 3.7 Sonnet (0.346), GPT-4o (0.320), and Llama 4 Maverick (0.249), though behind GPT-4.1 (0.479), Gemini 2.5 Pro (0.520), Grok 3 (0.543), and o3 (0.598) [3, 11]. This is notable given that MedGemma 1.5 4B has roughly 4 billion parameters and runs without internet connectivity, while the models it approaches or surpasses are significantly larger and require cloud infrastructure. Performance varied substantially across HealthBench dimensions. Communication quality was the strongest axis (0.7021), followed by instruction following (0.5589) and accuracy (0.5425). Completeness was the weakest (0.3780), confirming the HealthBench finding that completeness is the primary driver of overall model ranking. Among themes, emergency referrals scored highest (0.5559), with the model satisfying 90.6% of physician-written emergency behavior criteria in emergent cases. Global health scored lowest (0.3602), suggesting training data bias toward Western clinical contexts. The most clinically significant finding involved context-seeking behavior. When presented with conversations lacking sufficient clinical context, MedGemma satisfied only 7.2% of context-seeking criteria when clinical information was insufficient — while meeting helpfulness and safety criteria 97.2% of the time. The model consistently provides well-communicated, accurate, but incomplete answers without recognizing when it needs more information. This pattern — knowing what to say but not how much to say or when to ask — has direct implications for clinical deployment safety. All analyses were pre-specified before the evaluation was conducted. Code, results, and the pre-registered study analysis plan are publicly available. This evaluation demonstrates that rigorous, independent benchmarking of medical AI is accessible to individual researchers at minimal cost, and that such evaluation is essential before clinical deployment of any health AI system.

Citation

Please cite as:

Luginbuhl R

MedGemma on HealthBench: Evaluating Open-Source Medical AI on Consumer Hardware for $39

JMIR Preprints. 21/03/2026:95844

DOI: 10.2196/preprints.95844

URL: https://preprints.jmir.org/preprint/95844

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Preprints

Date Submitted: Mar 21, 2026

Open Peer Review Period: Mar 21, 2026 - Mar 6, 2027

(currently open for review)

MedGemma on HealthBench: Evaluating Open-Source Medical AI on Consumer Hardware for $39

ABSTRACT

Citation

Copyright