Currently submitted to: JMIR Preprints
Date Submitted: Mar 21, 2026
Open Peer Review Period: Mar 21, 2026 - Mar 6, 2027
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
MedGemma on HealthBench: Evaluating Open-Source Medical AI on Consumer Hardware for $39
ABSTRACT
We present, to our knowledge, the first independent evaluation of Google's MedGemma 1.5 4B instruction-tuned model on OpenAI's HealthBench, one of the most comprehensive physician-graded benchmarks for health AI [3]. The evaluation encompassed all 5,000 HealthBench conversations (48,562 rubric criteria created by 262 physicians across 60 countries), run entirely on a consumer laptop (Apple M1 Max, 32GB) using quantized weights (Q8_0), with automated grading via GPT-4.1 mini at a total API cost of $39.21. MedGemma 1.5 4B achieved an overall HealthBench score of 0.4512 (bootstrap SE = 0.0046), placing it ahead of o1 (0.418), Claude 3.7 Sonnet (0.346), GPT-4o (0.320), and Llama 4 Maverick (0.249), though behind GPT-4.1 (0.479), Gemini 2.5 Pro (0.520), Grok 3 (0.543), and o3 (0.598) [3, 11]. This is notable given that MedGemma 1.5 4B has roughly 4 billion parameters and runs without internet connectivity, while the models it approaches or surpasses are significantly larger and require cloud infrastructure. Performance varied substantially across HealthBench dimensions. Communication quality was the strongest axis (0.7021), followed by instruction following (0.5589) and accuracy (0.5425). Completeness was the weakest (0.3780), confirming the HealthBench finding that completeness is the primary driver of overall model ranking. Among themes, emergency referrals scored highest (0.5559), with the model satisfying 90.6% of physician-written emergency behavior criteria in emergent cases. Global health scored lowest (0.3602), suggesting training data bias toward Western clinical contexts. The most clinically significant finding involved context-seeking behavior. When presented with conversations lacking sufficient clinical context, MedGemma satisfied only 7.2% of context-seeking criteria when clinical information was insufficient — while meeting helpfulness and safety criteria 97.2% of the time. The model consistently provides well-communicated, accurate, but incomplete answers without recognizing when it needs more information. This pattern — knowing what to say but not how much to say or when to ask — has direct implications for clinical deployment safety. All analyses were pre-specified before the evaluation was conducted. Code, results, and the pre-registered study analysis plan are publicly available. This evaluation demonstrates that rigorous, independent benchmarking of medical AI is accessible to individual researchers at minimal cost, and that such evaluation is essential before clinical deployment of any health AI system.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.