Performance Large Language Models in Guiding Bladder Cancer Management - Expert Commentary

Bladder cancer (BLCA) management presents complex clinical challenges requiring deep knowledge of current guidelines and clinical expertise. A recent study by Li et al. evaluated the performance of several large language models (LLMs) in addressing clinical management questions related to bladder cancer.

The investigators developed 100 clinical questions based on established guidelines, covering epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. They tested Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) in three independent trials. The models’ responses were validated against current clinical guidelines and expert consensus. The team implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance.

In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy. Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest initial accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy in BLCA-related queries.

This interesting study describes the comparative performance of various LLMs in handling BLCA-related clinical questions and validates the potential for significant improvement through targeted training optimization. As the investigators note, more studies on human-AI interaction in clinical management are needed to optimize synergism and clinical care for patients with bladder cancer.

Written by: Bishoy M. Faltas, MD, Director of Bladder Cancer Research, Englander Institute for Precision Medicine, Weill Cornell Medicine

References:

  1. Li KP, Wang L, Wan S, Wang CY, Chen SY, Liu SH, Yang L. Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models. J Endourol. 2025;00(00):000-000. DOI: 10.1089/end.2024.0860
Read the Abstract