Performance Large Language Models in Guiding Bladder Cancer Management - Expert Commentary
The investigators developed 100 clinical questions based on established guidelines, covering epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. They tested Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) in three independent trials. The models’ responses were validated against current clinical guidelines and expert consensus. The team implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance.
In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy. Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest initial accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy in BLCA-related queries.
This interesting study describes the comparative performance of various LLMs in handling BLCA-related clinical questions and validates the potential for significant improvement through targeted training optimization. As the investigators note, more studies on human-AI interaction in clinical management are needed to optimize synergism and clinical care for patients with bladder cancer.
Written by: Bishoy M. Faltas, MD, Director of Bladder Cancer Research, Englander Institute for Precision Medicine, Weill Cornell Medicine
References:
Read the Abstract