Predictive Health Models: Unsafe and Under-regulated, Researchers Say

The UK is lifting the rose-tinted AI smartglasses to a much gray-er reality as governance scrambles to keep up. A research group in Birmingham has called for better guard rails, arguing that a large number of predictive models are potentially unsafe for use.

DATE

October 3, 2025

AUTHOR

The Language of Genomes

The UK is lifting the rose-tinted AI smartglasses to a much grey-er reality as governance scrambles to keep up. A research group in Birmingham has called for better guard rails, arguing that a large number of predictive models are potentially unsafe for use.

A lack of proper regulation means predictive models, including machine learning tools, may not work well for certain patients, putting them at risk of missed diagnoses and inappropriate care.

The analysis, published in BMJ Evidence-Based Medicine, points to a number of reported issues, including poor-quality training data, outdated technology, and inadherence to current medical device standards.

Senior author of the paper, Dr Joseph Alderman, NIHR clinical lecturer and AI researcher at the University of Birmingham, recognizes that, “while predictive models are, in many ways, an improvement” in healthcare practice, “very few models have been rigorously tested in real-world conditions.”

Models May Uphold or Worsen Biases

One of the most pressing concerns is bias. Models often perform best on the populations they were trained on — frequently white males, who dominate historical medical datasets. But that doesn’t mean they will perform equally well for everyone else.

By law, medical devices should demonstrate safety and effectiveness across the entire “intended population.” Yet many developers fail to recognize their software even qualifies as a medical device; without that recognition, the rules rarely bite.

Even when models are evaluated across diverse groups, there is little scrutiny of how they make their predictions. Some incorporate race, ethnicity, sex, or gender as “predictors” — things that factor into an algorithm’s calculation. But there’s often no biological rationale:

“Links between these attributes and worse health outcomes are often a result of social disadvantage rather than biology,” says Dr. Alderman. “Including these in models risks embedding this disadvantage into healthcare systems and amplifying existing inequalities.”

Taken together, the lack of testing across populations and the, potentially, careless use of personal attributes could very well swing the AI compass into unsafe territory.

The Impact Is Unclear, Experts Warn

So far, though, scientists just don’t have the information to understand how well these tools are performing.

“For many models, there is not enough data to confidently conclude that they work, and that they are safe to use,” says Dr Alderman.

“In healthcare, this matters. We need to have certainty that the things we recommend to patients are safe, and that they work as expected. Otherwise we risk patients making decisions based on predictions which could be misleading or untrue.”

The lack of post-market surveillance makes things even murkier. Because some developers don’t regularly vet models, problems easily fly under the radar — including new bugs, as we’d expect to see from any device after years of use.

This also means that models are often left outdated as clinical standards change. Most predictive models are rarely re-evaluated, the authors write in the paper, meaning any degradation in performance over time or failure to keep up with the clinical 'status quo' may go unrecognised.

Frameworks Are Being Neglected

Recently, frameworks were introduced to mitigate these issues. TRIPOD+AI and PROBAST+AI were created to reduce the risks of bias, low-quality data, and improve assurance. But many models were built before these frameworks’ appearances. It’s like placing a dam around an overflowing river: it might stop further “spillage,” but does nothing to address the water that’s already flooding the land.

Besides, whilst most medical journals make these frameworks mandatory, not all do. Dr Alderman tells us that the models published in these journals may not have been evaluated very rigorously, and that both doctors and patients should be more skeptical of such cases.

The Fall-Out

‍This becomes a huge problem when patient groups are neglected from clinical care innovations.

Most health devices, for example, have not been built with elderly people in mind. Tim Coote, after spending years building devices to monitor frail populations, says he was “flabbergasted” at how weak regulation has allowed for “very poor measurement standards” in the industry.

‍He points to worn devices, which often produce unreliable data if they are not calibrated or fitted correctly. In many cases, he explains, “there’s no feedback at the time of taking the measurement that you’ve got it wrong,” meaning patients and clinicians may be relying on flawed results without even knowing it.

Another caveat, Coote adds, is that the fall-detection software is trained from “young people jumping up and down” — data that reflects neither the patients most at risk nor the seriousness of the event itself. Without appropriate standards, it’s hard to know how reliable these devices truly are.

Equally troubling, Coote says, is that companies are rarely incentivised to maximise accuracy in the first place. Their focus is usually on reducing false positives — the number of false alarms — rather than false negatives, where real falls go undetected. From a patient’s perspective, the latter is far more dangerous: an elderly person left stranded for hours, or even days, after a fall. But without regulatory pressure to prioritise patient safety over operational costs, there is little reason for developers to shift their priorities.

“It’s because they get a cost of the ‘blue light’ every time it happens,” Coote explains, “which looks bad on them. But the number of missed cases go fairly unnoticed.”

The Road Ahead: Unchartered, But Not Impossible to Navigate

The Birmingham group suggest a number of solutions. Developers should be more transparent about their data sources, clinicians should better understand the technology they’re using, and frameworks like TRIPOD+AI should be made mandatory — these are just a few ideas put forward by Dr Alderman.

But he also recognizes the trade-off:

“Whilst greater adherence to software as a medical device regulations will be part of the answer, there is a risk that we build barriers so high that it is impossible to create newer, better models,” Alderman says.

Finally, many predictive models are available to use free of charge — which is good for implementation, but perhaps less good for ongoing innovation. Dr Alderman suggests “this may need to change to enable researchers to properly evaluate their performance and monitor them over the longer term.”

Expert Insight on the Future of Healthcare

with world-renowned experts