Matthew Cooperberg: Good morning. I'm Matt Cooperberg from University of California in San Francisco. It's a pleasure to bring you another installment of our UroToday Center of Excellence and Localized Prostate Cancer interview series. I am thrilled to be joined today by Anindo Saha, who is a PhD candidate in diagnostic imaging, medical imaging in Nijmegen. This is the name you may have seen on a number of papers in the burgeoning space of AI in MRI. And he is going to share with us some really exciting developments around this PI-CAI initiative, which is a really, I would say, phenomenal open-source initiative to try to bring standardization and higher quality interpretation of biparametric MRI into community practice around the world. And this is in, some would say, exactly the direction we should be heading in AI. So with that, Anindo, thanks for joining us. And why don't you give us a bit of background about the study and how PI-CAI got started?
Anindo Saha: Yep. Thanks for having me, Matt. So yep, I'm Anindo, and thanks for the introduction. So PI-CAI really began around 2021. We noticed that AI research in prostate MRI was really fragmented. Different people used different reference standards, different cohort definitions, really small sample sizes, and everything was quite behind compared to AI in many other subspecialties of medicine. So PI-CAI really started in an effort to unify that and get everyone on board and do it together essentially. So it's a study that was done under the oversight of a multidisciplinary advisory board. That's the first thing we started with. We wanted input from radiologists, urologists, international, not just restricted to our academic center and the way we do things, but globally, what's the practice and where is it going? And yeah, that really kicked off the study and then it went on from 2022 to 2024.
So the setting of PI-CAI really is, of course, we address prostate cancer detection on MRI. As we know that there are one million new diagnoses or more every year globally worldwide, and there are around over 300,000 deaths due to prostate cancer. And yeah, over the past five years or a little more than that, pre-biopsy MRI is now the international standard of care recommended in all major clinical societies. However, when it comes to radiologic care, the standard that's used to interpret prostate MRI scans, PI-RADS currently in version 2.1, it is susceptible to overdiagnosis, a lot of inter-reader variability. You need subspecialist experience, which is not available equitably everywhere. And of course, there's also a rising demand in imaging as populations around the world are aging. So in many other subspecialties of medicine, we've noticed that AI can assist, but for prostate cancer detection on MR, we haven't really seen wide-scale adoption, and there's not enough evidence for that. So yeah, that's where PI-CAI really started. So if we look at the study design from a global view, it really starts firstly, we wanted to curate a large multicenter, multi-vendor cohort.
And we did this across four tertiary care centers based in the Netherlands, but also one based in Norway used for external testing. We followed the standard convention as you would for studies like this. So yeah, patients did not have a prior diagnosis of clinical cancer. In the context of this study, of course, that's very important to define, which is Gleason grade group two or greater cancer. And we had to exclude MRI examinations with missing outcomes or diagnostically insufficient imaging quality where the radiologist said, "Yeah, we need a repeat exam," for instance. Yeah. This is also in line with the limitations of retrospective studies, of course, but the main thing was to curate a large dataset for this. So just to keep it in mind, or as context, the largest dataset or the largest datasets being used in prostate AI literature as of 2021 were around 2,000 cases. And for really AI or deep learning-based AI to shine, you need really large cohorts. So this was the first of its kind with this scale of data, 10,000 cases and so, from around 9,000 patients. And that was the starting point. So then we kind of partitioned it. So part of that data was being used to develop an AI system where we invited prostate AI researchers around the world.
As baselines, we made available the best AI system that we could come up with at our institute, but then let that be the starting point for the rest of the world to take it further. And we used the testing dataset to then organize a multi-reader, multi-case observer study with radiologists. In total, we had around 62 radiologists who participated from 45 centers in 20 countries. And finally, once the AI system was developed and trained, we did a head-to-head testing between the radiologists and the AI system on these cases. Of course, we also looked at how the AI systems performed in comparison to the standard-of-care readings that were made for these cases historically. So our primary outcomes are twofold. Firstly, when we look at the AI versus the radiologist participating in the reader study, on average, the AI demonstrated statistically superior diagnostic performance. And so it's worth noting that we originally designed this study for non-inferiority.
That was our primary hypothesis, but had that been passed, we would then test for superiority in a hierarchical fashion. And we see that AI did pass superiority as well. So notably what this means is it generates 50% fewer false positives at the same cancer detection rate as the PI-RADS 3 operating point or greater, or you could look at it as at the same false-positive rate, it gets around 6% more cancers. So the second fold of the primary outcome was the performance of the AI system in comparison to the historical reads that were made in routine practice. There are a few things worth noting why now this performance seems to be kind of close, is that radiologists participating in the reader study only had access to exactly the same information as the AI system plus DCE scans.
What I mean by that is they did not have access to patient history, for instance, they did not have access to peer consultation. And these are things you would typically also have access to during routine practice, MDT meetings, et cetera, et cetera. So these reads kind of account for those. And with that, performance goes up a bit, but AI still just based on biparametric MRI and PSA and prostate volume and patient age manages to perform just the same. We did not pass non-inferiority, but this was due to half a percentage in specificity, which really is quite literally one false positive more in the context of this study. So make of that what you will. But yeah, that is the primary outcomes of the PI-CAI study. So to kind of summarize, on the clinical side, we see that AI on average did statistically better than the 62 radiologists participating in the reader study. You could look at it in two ways, that it catches more cancers at the same false-positive rate or way fewer false positives at the same cancer detection rate.
And it was also very comparable to the standard-of-care readings that were done for these cases. On the open science part, as you pointed out, and this is something we really push for in this space, we made a public dataset available. This is multicenter, multi-vendor cases. It's also the largest public dataset for prostate cancer detection on MR, especially developed for AI development that's now available. We've also made all our source code available for data prep, for training these AI models, the trained models, also for statistical testing exactly the way we did it, as well as the benchmark on the grand-challenge.org platform that anyone can upload their AI algorithms to and have it validated in a completely blinded way where they don't get to interact with the testing data directly. And in fact, as of now, so a year and a half since the PI-CAI study was published, there are three more models that came up that are even better than what we studied here.
Matthew Cooperberg: Talk a little more about that, because you got hundreds of submissions initially, right? So talk a bit about the initial process of, because this was a truly global effort. There were dozens of countries involved in developing the models.
Anindo Saha: This is kind of a snapshot as of 10th of February 2024. You see that we had well over 1,500 people participating who opted in. There were 70 AI teams who were also registered. There were over 350 algorithms submitted from 50 countries as of then. And of course, this is a rolling thing. So in the past one and a half years, there have been more and more submissions. This also included a lot of people from the industry. So Siemens Healthineers participated, Philips registered, Avenda as well, Guerbet as well. In fact, Guerbet had the winning algorithm in the PI-CAI challenge. And yep, this is from Guerbet Research in France, as you can see there. But indeed the interesting thing for us was the top five systems really came from very different places all around the world. And that also kind of shows, yeah, we wanted to level the playing field that a lot of people might not have access to high-quality imaging, curated datasets like these, or benchmarks. We wanted to make that available so that the people who do have the AI expertise could all participate and we see how globally we're doing in this space.
Matthew Cooperberg: This is five countries, four continents in the top five, which I think is pretty awesome. The participation by industry, that's actually something I had not noted when I read the paper. How does this compare to the... there's a ton of commercial products which are coming into the space as well, to do very similar things. I'm sure there have not been a ton of head-to-head testing, but do you have any sense of how this compares to some of the other commercial products that are hitting the market?
Anindo Saha: So you kind of touched upon a pretty important point actually, Matt. And yeah, there's a slide kind of to highlight what I mean by that. So there's a lack of clear evidence when it comes to how these companies are performing. None of them really benchmark their products in an open-source dataset that's managed with an independent scientific advisory board, let's say. We have not a lot of evidence on each of these systems retrospectively as well. And we reached out to all of these companies at RSNA. We reached out to them at ECR. Most of them weren't willing. Some of them did participate. So Siemens, for instance, submitted their research prototype, and it did really well.
So it did, as far as you can see on the leaderboard, it did better than the radiologist as well. However, are these the same algorithms that are being then applied in standard of care in their commercial packages? Not really. These are all still prototypes. But what we're now pushing for is also through the PI-RADS committee is that these companies start to benchmark their models also on these independent performance registries. Siemens, I can disclose, is one of the first who's agreed to that. We're now benchmarking their product officially on PI-CAI and results should be out soon, but yeah, we would encourage more and more companies to participate and do this.
Matthew Cooperberg: Very exciting. There was a follow-up study recently from a few of your colleagues, Jasper Twilt, I think was the first author looking at PI-CAI, so it's an important clarification point, the PI-CAI does not recapitulate the PI-RADS score, right? It gives likelihood ratio finding or likelihood percentage of finding grade group two or higher. The second study I think had asked the question, could PI-CAI improve the radiologist's assignment of PI-RADS? And I think the bottom line was yes, it could, but it also did better by itself without the human input. So what does the future workflow look like here and what's the implication for radiology?
Anindo Saha: So there are many different ways you could have an AI-assisted workflow. The study that you referred to by Jasper, that was published in JAMA Network Open this year. We did see that AI performs better than radiologists. A large reason why that tends to happen also is because radiologists tend to be quite conservative. They will order extra biopsies even when they're a bit unsure. AI tends to be much more certain. So if it knows that there's cancer, it'll go for it. If it's negative, it won't ask for that extra biopsy. And there are many different ways we can investigate these workflows. So let's say the green one here is the prostate MRI exam. The red one is the radiologist.
The yellow one is the urologist. The blue one here is the AI system. And finally, this is the decision to biopsy or not. So you can look at it. This is a concurrent workflow, what we would call, and this is also the one that was investigated in Jasper's study, which is AI informs the radiologist. Both of them have access to the MR scan, but the radiologist is the one that ultimately then defers the decision to the urologist with the data at hand. So this is a concurrent reading setup. Of course, the other way we could also do this is we treat them as two separate independent readers. The radiologist has nothing to do with what the AI says. The AI has nothing to do with what the radiologist said. And it's ultimately up to the urologist to factor this information in and decide, do we need to go for a biopsy or not? We could also look at it in C. So in C, what we have is the AI is now the first reader.
So the AI, if it's very clearly confident and says, "Yeah, you know what? I think this is cancer." Will make a certain recommendation and maybe then you can have a low-vigilance review. Or the AI could say, "No, I'm not too sure if this is cancer or not." And then you have a high-vigilance review. And then some of the patients can directly, based on the certainty of those decisions, go home or be safely discharged, and some of them can then go back to the urologist to make a decision on whether to biopsy or not. This is kind of the safest approach that you could think of. There's always a human in the loop. But let's say we now start to trust the AI system that, yeah, we don't need so many people to look at these exams. Let's take one of them out, notably the ones where the AI system is very confident. Then it kind of looks like D. So we now take out that one human in the loop for low-vigilance reviews and the AI system can safely discharge a portion of the cases themselves.
This is where we start to really get a return on investment when it comes to a lot of these commercial products because you are actually effectively having a workload reduction here. A large portion of the cases are just autonomously handled. Finally, if we now trust the AI systems a lot more, the AI system could completely defer a certain portion of cases to be safely discharged, completely directly defer a certain portion of cases to the urologist for consideration, and only when it's unsure do we then consult with a radiologist to make the final call. This kind of really saves on the workload and radiologists can also then spend their time dealing with the most complex of cases and not the vast majority of easy yes, nos in the healthiest cases. To summarize, that's what we would see.
Matthew Cooperberg: It's fascinating. Obviously the regulatory piece is going to unfold in, I'm sure, incredibly complicated ways in the various countries as this all becomes reality. Although we will have the issue, of course, of if the radiologist is only looking at the more complicated cases, we're going to have a lower and lower experience among the radiologists. We already have this PI-RADS variability. I think it'll be fascinating to see how expert readers versus non-expert readers do in this sort of adjudication kind of framework, but interesting space.
Anindo Saha: You also brought up a really interesting point. I think this was covered in an editorial piece that came out earlier this year in the AJR special series. I think someone from Mayo Clinic, Dr. Takahashi maybe commented on it, which is if you look at the Jasper study, indeed AI is better than AI plus radiologists, which is better than just radiologists. He kind of argues that with sufficient training of radiologists to use AI systems effectively, they should perform as good as AI alone. That gap is really like radiologists not being able to really trust the system and let it do its job where it does it really well. So perhaps we get to there with sufficient education and training.
Matthew Cooperberg: Fascinating. One more question on this. Has anybody spent any time looking inside the black box? What is actually happening in the neural net models? Is it looking at diffusion weighting? Is it T2? Is it a combination? Are there things in the background? Is it all in the tumor? Is that part of scope?
Anindo Saha: Yep. So in PI-CAI it would be kind of the models that we develop to be very transparent at the risk of it being a little technical right now. They're detection models, which means that the output of these AI systems are really lesion detections. We're not just getting this case has cancer or this case doesn't have cancer. It really shows you that, okay, this part of the image is where the cancer is, and this is the likelihood score associated with it. This is of course a really important part, especially for medical decision-making where you need that level of interpretability. Now, if we go back into what's happening inside the box, why does it think this particular lesion here, let's say, has cancer? Yep.
We see that most of the weight is really dependent on the diffusion-weighted scans. It really uses the T2-weighted imaging for certain cases, especially in the transitional zone. This is also quite similar to how radiologists operate, but most of the weighting goes to the diffusion-weighted imaging. And what we also saw with internal experiments not published is that if you now also add contrast-enhanced imaging as an input for the AI system, it doesn't really care much about it. It is redundant information compared to the diffusion-weighted imaging.
Matthew Cooperberg: Interesting. Yeah. Great. And last, if somebody actually wanted to start playing with this model to download it, to test it out in their system, where do they go? How does this work through grand-challenge.org?
Anindo Saha: Yeah. So they can just go to grand-challenge.org if they want to just test out a few cases that they want to try out so they can upload their cases there directly. It's fully anonymized and then you get a prediction out. But of course, if you want to play with it much more rigorously and just see how this model performs at your own clinical site, we're totally open to that. So please just reach out at my email handle or that of any of the PI-CAI organizers and we can set that up where you have basically this dockerized container, the algorithm to then deploy at your site or just investigate further for your patient cohort. It's what we are also doing with many participating centers right now where we're able to now go from the PI-CAI setup of, let's say, four tertiary care centers, to now an ongoing study that we have called SCARLET, where we include 46 sites across 46 cities in 22 countries. And effectively, all of these centers are eligible to just get a copy of that algorithm and play with it at their site. Yeah.
Matthew Cooperberg: Amazing. This is really hopefully going to be a game-changing initiative for imaging prostate cancer. Thanks so much for sharing your time with us, and this is going to be a very, very exciting space for us all to continue watching as the progress in PI-CAI unfolds. So thanks for all your work and thank you for your comments today.
Anindo Saha: Thanks for having me. Yep.