Assessment of the Validity of ChatGPT-3.5 Responses to Patient-Generated Queries Following BPH Surgery - Beyond the Abstract

Artificial intelligence (AI) tools, including large language models (LLMs), are rapidly finding their way into everyday medicine. Nowadays, it is not uncommon for physicians to be questioned by patients about information that they have obtained from such tools, most notably ChatGPT. However, many patients fail to verify this information with their physician, and that is especially concerning among patients recovering from surgical procedures. As urologists, this prompted us to ask: “How reliable is ChatGPT-3.5 in answering questions posed by patients during their postoperative recovery after benign prostatic hyperplasia (BPH) surgery?” Of particular interest was the evaluation of potential hazards and patient safety.

To answer the question, we carefully selected realistic queries from forums, social media platforms, and instruction booklets posted online concerning immediate post-operative recovery after BPH surgery. Two reviewers selected the most important 216 questions, divided them into main categories including pain, bleeding, catheter care, urination, and alarming signs/emergencies, and populated the questions over all included surgical procedures ranging from TURP to Aquablation. These reviewers then graded ChatGPT’s responses, resolving disputes with a 3rd more senior reviewer, and assigning a category to each incorrect answer.

While the results were encouraging, they were far from perfect. Based on our strict grading scheme, 78% of the answers were accurate and comprehensive enough to be deemed clinically acceptable, while only 2.3% were completely inaccurate, with some posing hazards on patient safety, like failing to trigger immediate contact with the surgeon or prompting a visit to the emergency room in cases of massive bleeding or urinary retention. The rest of the answers (19.5%) contained small errors that do not necessarily affect the validity of the answer but may confuse the patient and lead to mistrust of the chatbot. Importantly, ChatGPT was clear, empathetic, and considerate when answering all questions.

Interestingly, newer surgical procedures had a higher percentage of correct answers, likely because data posted online about these procedures, compared to older procedures, is less likely to be confusing to the AI model. This poses an important question: “How can we trust a model that can be so easily confused during a time where available data online is so rapidly changing? And what about temporal reliability of the responses?”

This leaves us with an important takeaway: While ChatGPT is far from being capable of replacing physicians or healthcare providers in providing postoperative instructions, it can serve as a relatively safe and reliable adjunct. Perhaps, for now, physicians should know how to educate patients about available AI chatbots, including ChatGPT, and teach them how to safely use them. The next step would be to fine-tune available AI chatbots, including ChatGPT, to become more medically oriented, safe,r and more accurate.

Written by: Jad Najdi, MD, and Albert El-Hajj, MD

Department of Surgery, Division of Urology, American University of Beirut Medical Center, Beirut, Lebanon.

Read the Abstract