Shaping the Future of Effective Artificial Intelligence in Medicine

— Testing and rating the performance of AI models used in healthcare will be key

by Jeremy Faust, MD, MS, MA, Editor-in-Chief, 榴莲视频; Emily Hutto, Associate Video Producer February 12, 2024

Jeremy Faust is editor-in-chief of 榴莲视频, an emergency medicine physician at Brigham and Women's Hospital in Boston, and a public health researcher. He is author of the Substack column Inside Medicine.
Emily Hutto is an Associate Video Producer & Editor for 榴莲视频. She is based in Manhattan.

To address the expanding presence of artificial intelligence (AI) in healthcare, researchers led by Nigam Shah, MBBS, PhD, of Stanford University in California, have proposed a public-private partnership to create a nationwide network of health AI assurance labs that would offer objective evaluation and ongoing assessment of AI models in healthcare.

In this exclusive interview, Jeremy Faust, MD, editor-in-chief of 榴莲视频, and Shah discuss the proposal, published in JAMA, and dive into the nitty-gritty of making AI accurate and secure enough to be used in medicine. (Click here to check out 榴莲视频's glossary of common terms used in AI.)

The following is a transcript of their remarks:

Faust: Hello, Jeremy Faust, medical editor-in-chief of 榴莲视频. Thanks for joining us.

Joining us today is Dr. Nigam Shah. Dr. Shah is a professor of medicine at Stanford University and chief data scientist for Stanford Healthcare. He is the author of a manuscript that appeared in JAMA in December, "," talking all about how we can make sure that we have great quality in AI as it comes forward.

Dr. Shah, thank you so much for joining us.

Shah: It's great to be here. Thanks for having me.

Faust: Let me describe a fantasy of the future. I walk into the patient's room, I talk to them, I do my exam, and I leave there thinking, "OK, it's viral pneumonia versus COVID versus bacterial pneumonia." By the time I get back to my computer, my phone has heard all of that. It brings up the chart and says, "OK, you seem to be thinking about viral versus bacterial pneumonia. Is this the workup you want?"

Depending on what it finds and whatever risk stratification score -- like CURB-65 -- you want to use, they could go home and I'll make their discharge instructions. It'll be in Spanish because that's what they speak.

Then I can go see the next patient, because it read my mind having heard the interaction and reading the chart.

Shah: We're not that far off. Actually, in 10 years, I hope we get there.

The vision you laid out, parts of it already exist. It is possible today to have an app listening to a doctor/physician conversation, and by the time they're done in 60 seconds produce a transcription that you can turn the screen around -- as our CMIO [chief medical information officer] Christopher Sharp likes to say -- and the physician and patient read it together, and it's accurate.

Separately, we have models that can take that input and produce really accurate differentials. And separately, we have models that, given a differential, can reliably recommend the workup given your institution's ordering patterns.

So those three pieces exist. They're not joined together yet, and I hope in 10 years we can glue them together. So it's not that far off, actually. You're pretty spot on.

Faust: In your article in JAMA, "," there's a focus on, in a way, taking that interaction, that scenario I just described and saying, "Well, how do we know that what it churns out is actually any good and doesn't replicate some of the worst things that we've inherited in our system?" Is that what that article is really about, making sure that that future is a good one?

Shah: Or at least we can get to that future.

The assurance lab idea is motivated by the fact that right now, multiple entities can build models. Medical health systems can build them themselves, I'm sure vendors and EHR [electronic health record] providers can build them, students and researchers can build them, but we don't have a shared way of knowing whose model is any good.

I love driving and car analogies -- so imagine we didn't have the National Highway Safety Administration Board and we didn't have Consumer Reports and the people who test cars. It would be darn hard to find the car you need. And before you even get to that level of testing, we need to test the engine, we need to test the tires, we need to test the microchips that go into that.

And so if, in the scenario you painted, there were at least three or maybe 30 models that would be used in order to realize that vision, we need to be able to test individually each one of them as components. Then we need to be able to test the system.

In our view, the assurance lab is a step towards that direction so we can start testing the components, and most importantly, we can report the performance in a transparent way in a national registry.

Faust: For as long as we've been hearing about big data, we always hear "garbage in, garbage out," and it all depends on what is fed in. But with what you're describing, it seems like the single most important thing in driving this machine or this car is the determination of: what is the gold standard? Who decides who's right?

How do you do that? How do you know the difference between it replicating something that we don't like versus it actually revealing something that is true that we don't like?

Shah: Yeah. So great question. I'll answer it in two parts.

Part one is that in medicine we're attached to this notion of external validity -- as in, if there's a model that I use at my institution, it has to work exactly the same way, same performance at yours. I ask, "Why?"

I've written an article on this with Dr. Michael Pencina from Duke arguing that for a readmissions model, for example, something that's using operational inputs, what is our scientific basis by which we say the model that works in Palo Alto should work the exact same way in Mumbai and Beijing? The healthcare systems are different, the society works differently -- it makes no sense.

For physiological models, yes, we might have that insistence. And because we're borrowing from that mindset, the physiological mindset, we're kind of putting ourselves in this awkward corner that we want the model to perform locally, but we want to validate it externally and globally. It makes no sense.

So what we're saying is that we should come up with a regime for recurring local validation. So that's nice, it's a concept and an idea. How do we do that?

So if I give you a model, I need to be able to tell you that these are the eligibility criteria for the humans on whom you can apply the model. Those criteria can include data completion requirements saying, "I need you to ensure that there are at least 2 years of data before you even try using your model on your patients."

And then second, given your setup, you have to come up with your own criteria for what is acceptable performance. And it is not in terms of AUROC [area under the receiver operating characteristic curve] and AUPRC [area under the precision-recall curve], because the benefit we get from a model is tightly coupled with what action we choose to take in follow-up or what action we avoid taking in follow-up. So you have to analyze the construct of the model plus the workflow together. And because the workflow is very site-specific and because your data are site-specific, you have to do this evaluation locally.

Faust: If I'm hearing you correctly, I'm thinking about something like a clinical decision tool that might be very applicable in one setting, but not in another.

For example, to me, the Ottawa Knee Rule is kind of useless because I know that if my patient doesn't get an x-ray today, then they're going to get it in 3 weeks because they can't get an MRI without it -- because all my patients who have any bit of knee pain seem to end up getting MRIs.

Whereas if I was working in any other place with more reasonable utilization on this issue, the Ottawa or Pittsburgh Knee Rule that I'm using decreases the x-ray in the [emergency room] and it never happens again and it actually saves money.

Shah: Exactly.

Faust: So the idea is, does the decision tree that you're sitting under work in the ecosystem that you exist in or not?

Shah: Exactly. The model's output and its effect are inherently coupled with the context in which it is operating.

Hence with this notion of these assurance labs, we're arguing for a systematic process, which on our campus we call a FURM assessment -- fair, useful, reliable models -- but you can only do that assessment in your local context.

The hope is that these assurance labs can give you some global validation. Maybe they can sample data to match your distribution and give you some performance readout that might match your local validation. But in the end, they're also able to give you a prescription saying to monitor these three things when you use it so you know it's working as intended.

Faust: I know that a lot of what you're doing is designed to say to clinicians, "Don't sit this out. Get involved," because it's happening and you'd like your expertise to be taken into consideration and for all of our collective lived experiences to be under the hood -- to extend the car model. But how do we get involved? It sounds like a lot of meetings, and a lot of people don't like meetings.

Shah: Meetings are one way. Voting with your clicks and not using things that don't meet the performance bar or the expectation of the way things should be is another. There are ways of influencing purchasing decisions without going to meetings.

There are ways by which you could leverage your patient advisory councils and prioritize which are the problems that are worth solving first. Because what happens typically is the-squeaky-wheel-gets-the-grease kind of situation. The people who complain the loudest, their problems get prioritized the most. But if you take an external view, the people who suffer the most are the ones who don't get the appointments at your hospitals.

Faust: I've heard you talk about the difference between privacy and security, and I just wonder if you could share a little bit about that with our audience. Tell me, are you worried about the security piece? I think I know what you think about the privacy piece, but let's hear what you think about it.

Shah: So, I'm probably on the controversial side of things here. This thinking is inspired by Ruth Faden's arguments, who is a philosophy professor at Johns Hopkins. The simple, elegant argument she makes, which I fully agree with, is if I want to benefit from other people's data, is it not my duty to share my own?

We all talk about the learning health system, we want to learn from aggregate patient data and so on. But if as an individual, I don't make the choice of sharing my data, we're never going to get there.

So I'm all for data sharing, and I believe that our privacy laws as currently set up are a bit outdated. Under the name of HIPAA [Health Insurance Portability and Accountability Act], we basically block information sharing for medical care and research use, whereas information flows freely for payments and business associate arrangements.

What we want is we want the data to be secure. I don't want my record being leaked out on the internet and being sold on the black market and I get a dark web notification from Google, but I do want my information to benefit the care of hundreds or thousands of other people.

We have to look at it from the lens of, where do we want to be? If we want the learning health system, if we want decision support that is informed by the past experiences of patients like mine, we have to get over this privacy block and insist on secure sharing of data.

Faust: When I write about a patient, which doesn't happen too often in my writing, I'll change the age or maybe the gender or something. I certainly don't do it the same week or month. That sort of suffices for not revealing protected health information.

Can the same thing be done in large data sets? What difference does it make if the model is trained based on a person who's 41 years old with a creatinine of 1.31 and a 40.2-year-old with a creatinine of 1.28? Why not tell the system to sort of quasi-randomize real data?

Shah: That is actually what we do on our campus. We are amongst the few sites, not the only site, but amongst the few sites where de-identified data are available for research very broadly to anyone in the school of medicine. There is a blanket IRB [institutional review board]. You do have to click a data attestation saying, thou shall not misuse the data, you will not resell it, all of that stuff. It's a legally-binding agreement.

But any student, any fellow, any resident, postdoc can get complete de-identified data warehouse access, where we've jittered the dates, we've redacted the names, we've done hiding-in-plain-sight where we replace a real name with a fake name. I've tried finding myself and I can't find myself, but in aggregate we can learn the exact kind of thing we would learn from identified data.

So absolutely, that is one way to do it. [It's] probably something we can do immediately, but it doesn't really meet the -- I can't share those data, right? I can't give it to you.

For example, given our institutional policies, even though from a legal standpoint they're de-identified, I can't put them on the internet. That's where the security part comes in. After all these tricks are done, we still need ways of securely sharing so that we can learn from millions of records.

Faust: Alright. Well, this is an extraordinarily exciting time for this. I'm really grateful that you're on the frontline of making sure that we do it right. Dr. Shah, thanks for joining us today.

Shah: Oh, absolutely. Thanks for having me.