Artificial Intelligence versus Classical Scoring Systems: A Comparative Analysis of Stone-Free Prediction after Percutaneous Nephrolithotomy - Beyond the Abstract
In this context, traditional nephrolithometric tools such as Guy’s Stone Score (GSS), the S.T.O.N.E. nephrolithometry score, the CROES nomogram, and the S-ReSC score have long been used to estimate complexity and probable success. At the same time, artificial intelligence—particularly large language model (LLM)-based tools—has generated growing interest as a possible new generation of decision support in endourology. In our study, we sought to directly compare these two worlds: established stone scoring systems and a ChatGPT-based urinary stone-free prediction tool. We retrospectively analyzed 340 patients who underwent PNL between 2019 and 2025. For each patient, we calculated GSS, CROES, S.T.O.N.E., and S-ReSC, and we also generated an individualized prediction using a ChatGPT-based Urinary Stone-Free Predictor. These predictions were then compared with actual postoperative outcomes using correlation analysis, logistic regression, and ROC analysis. The overall stone-free rate in our cohort was 60.9%.
Patients with residual stones had significantly higher GSS, S.T.O.N.E., and S-ReSC values, while CROES and the ChatGPT-based predicted probabilities did not significantly differ between the stone-free and residual-stone groups. On multivariable analysis, only GSS and S.T.O.N.E. remained independent predictors of stone-free status. ROC analysis also showed that GSS, S.T.O.N.E., and S-ReSC had significant discriminatory value, whereas CROES and the ChatGPT-based model did not. Most notably, the ChatGPT-based tool yielded an AUC of 0.481, indicating performance no better than chance in this clinical setting. These findings are clinically relevant for several reasons.
First, they suggest that in a real-world, relatively complex referral population, conventional nephrolithometric systems still provide more dependable guidance than a general-purpose LLM-based prediction tool. Second, they highlight an important distinction that is often blurred in current discussions about AI in medicine: not all AI tools are equivalent. A language model that can synthesize information fluently is not necessarily a robust predictive model for surgical outcomes. Prediction tasks in surgery require transparent model development, structured datasets, calibrated weighting of variables, and rigorous external validation. When these elements are absent, even a sophisticated AI interface may produce outputs that appear plausible but lack reliable clinical discrimination. Our results should therefore not be interpreted as a rejection of artificial intelligence in stone surgery. On the contrary, they support a more precise and responsible path forward. The problem is not AI itself, but the premature use of non-transparent, insufficiently validated, general-purpose systems in high-stakes decision-making. In fact, prior research suggests that disease-specific machine learning models trained on structured clinical data can outperform classical scoring systems in selected settings. The future likely belongs not to generic LLM outputs alone, but to dedicated, transparent, and externally validated AI models built specifically for endourological outcome prediction. Another important message from our study is that even the best-performing conventional scores demonstrated only modest discrimination. This reflects the intrinsic complexity of PNL outcomes, which are influenced by multiple interacting factors, including stone burden, distribution, anatomy, density, and procedural variables. In other words, stone-free prediction remains a difficult clinical task, and there is still considerable room for innovation.
The challenge for future model development is not merely to replace older tools with newer technology, but to produce models that are truly more accurate, reproducible, and generalizable across different centers and patient populations. In summary, our study supports the continued practical value of classical nephrolithometric scoring systems—especially GSS and S.T.O.N.E.—for predicting stone-free outcomes after PNL. At the same time, it cautions against overinterpreting the capabilities of current general-purpose LLM-based tools in surgical prediction. Before such tools can be integrated into routine endourological decision-making, they must undergo disease-specific training, calibration, and multicenter external validation. Until then, conventional scoring systems remain the more reliable option for everyday clinical use.
Written by: Burak Elmaağaç, MD, Assistant Professor, Department of Urology, University of Health Sciences, Kayseri Faculty of Medicine, Kayseri City Hospital, Türkiye
Read the Abstract