The Next Frontier of Medical AI: Airlocks and AI Monitoring - Newton's Tree

DATE

July 7, 2025

This article is based on insights from “The Next Frontier of Medical AI: Airlocks and AI Monitoring,” event hosted by Newtons Tree. It brought together leading clinicians, researchers, and policymakers discussing the challenges of AI deployment in healthcare settings.

The Great AI Adoption Paradox

There are now over 1,000 FDA-approved AI algorithms available to healthcare systems, with CE-marked approvals adding hundreds more. Yet walk into the average NHS hospital and you'll find fewer than five live AI applications actually being used for patient care.

"AI isn't really flying off the shelves in the NHS," explains Haris Shuaib, founder of Newton's Tree and former head of the NHS's first dedicated medical AI team. "You might even call it a bit of a market failure."

The numbers support this assessment. Despite an avalanche of available AI products—more than 100 AI ambient scribe products alone compete in just one market segment—clinical adoption remains limited. Most NHS trusts are stuck in a cycle of trials, pilots, and "kicking the tires," but few are seeing consistent, business-as-usual value from AI deployments.

This isn't due to lack of demand. The problem lies deeper: in the fundamental challenges that emerge after AI systems move from controlled development environments into the messy reality of clinical practice.

The Brittleness Problem

Recent research from UCL has characterized AI as inherently "brittle"—sensitive to environmental changes in ways that can dramatically affect performance. This brittleness manifests in what experts call "hyper-local" performance variations.

"Prof David Lowe and I have had the pleasure of running a national trial involving AI technology," Haris recounts, "where we've seen firsthand deploying it at his institution versus mine led to radically different performance characteristics out of the box."

This environmental sensitivity creates a unique challenge for healthcare leaders. The traditional approach of calling a colleague at another trust to ask "how's that AI system working for you?" provides no guarantee of similar performance at your institution. What works brilliantly in Manchester may fail spectacularly in Birmingham, not due to any flaw in the technology, but because of subtle differences in equipment, workflows, patient populations, or clinical practices.

The problem compounds over time. Even if an AI system performs well initially, changes in the clinical environment can silently degrade its effectiveness. Equipment gets upgraded, clinical workflows evolve, patient demographics shift, and staff turnover brings new practices. Each change can potentially destabilize AI performance in ways that aren't immediately apparent.

The Automation Bias Trap

Perhaps more concerning than environmental drift is the phenomenon of automation bias—the tendency for humans to over-rely on automated systems, even when those systems are providing incorrect guidance.

A study from MIT demonstrates this problem vividly. Researchers showed that an AI system analyzing dermatology images could confidently classify a benign lesion as benign—but when imperceptible noise was added to create a visually identical image, the same AI would classify it as malignant with equal confidence. The critical point: dermatologists couldn't tell the difference between the two images, making them unable to serve as an effective safety check.

Real-world evidence reveals even more alarming implications. In one study examining radiologist performance when assisted by AI, researchers found that when the AI provided correct guidance, radiologists of all experience levels performed similarly—achieving about 80% accuracy and fulfilling AI's promise of standardizing care quality.

But when the AI gave incorrect guidance, experienced radiologists with an average of 15 years of practice saw their accuracy plummet to barely 40%. The very expertise that should have protected them from AI errors instead seemed to make them more vulnerable to automation bias.

"so you can begin to see how this over-reliance manifests itself in patient harm” notes Haris, highlighting how AI's promise to enhance clinical decision-making can paradoxically create new vulnerabilities.

Governance Gaps in the Real World

The current healthcare risk management framework, developed for traditional medical devices, proves inadequate for AI systems that can change behavior over time. While clinical governance has been a legal duty for NHS organizations for over 25 years, it relies on periodic audits and reactive incident reporting—approaches that miss the dynamic nature of AI performance.

"The mitigations are almost never feasible to implement," Haris observes. "Like how does an NHS organization monitor data quality for AI? We don't do it for anything else. How are we suddenly going to find budget and resources to do it for AI?"

The governance challenges extend beyond individual hospitals to the broader ecosystem. During a roundtable exercise where experts were split into vendor and hospital groups and asked to respond to a hypothetical AI failure, the result was revealing: "The group turned on each other," reports Robin Carpenter, Head of AI Governance & Policy at Newton's Tree. "Which is very interesting, right? Because there's no clear boundaries of you should be doing this, I should be doing that."

This ambiguity about responsibilities creates dangerous gaps. In some cases, there's no post-market surveillance happening at all. In others, minimal information flows back to regulators or product developers. The result is a system where AI failures can persist undetected, potentially harming patients while providing a false sense of security to clinicians.

The Resource Reality Check

Even when healthcare leaders recognize these challenges, practical constraints limit their ability to address them effectively. IT departments, already stretched thin keeping existing systems operational, view AI deployment as a lower priority than preventing cyber attacks or maintaining core infrastructure.

The pressure on clinical staff adds another layer of complexity. An internal NHS email that went viral on social media perfectly captures this reality: staff were asked to complete 700 discharge letters at an average of three to five minutes per letter—a task that realistically requires 30 minutes per patient to do properly.

"You should do about 60 of the bulk, which should take you three to five hours," the email instructed. This kind of unrealistic pressure makes AI solutions appear attractive as quick fixes, but without proper monitoring and oversight, they risk creating new problems while seeming to solve old ones.

The Evidence Challenge

Traditional approaches to evaluating medical technologies also prove insufficient for AI systems. While many AI products have regulatory approval, the bar for some of these approvals is relatively low, and vendor studies tend to be optimistic about real-world performance. ‍

Prof. Alicja Rudnicka, who leads population health research at St George's University of London, advocates for a more rigorous approach: "I think you need the best evidence up front before you let it loose. Because otherwise, you could bring in more harm than good."

Her methodology involves creating diverse, large-scale datasets—in one case, over 2 million images from 100,000 people—to ensure AI systems are tested across the full range of patients they'll encounter in practice. Crucially, she insists on independent evaluation with no commercial conflicts of interest.

"Head-to-head evaluations that are done in the intended healthcare setting, a priori," she emphasizes. "Where else would get that? You know, it would cost a fortune to have something like that done, to set up a study like this for yourself."

Solutions on the Horizon: Monitoring

Recognition of these challenges is driving innovation in AI monitoring and governance. Newton's Tree's Federated AI Monitoring Service (FAMOS), developed with support from Innovate UK and in collaboration with the MHRA AI Airlock, represents one approach to addressing the post-deployment challenge.

The system provides near real-time monitoring of AI performance across multiple hospital sites, allowing for early detection of drift and performance degradation. By creating a "sandbox" environment where AI systems can be evaluated on local data before full deployment, hospitals can better predict how systems will perform in their specific environment.

"We're not fundamentally an AI company," Haris explains. "And we're not a technology company. We're a healthcare company. And what we actually care about is doing AI properly to deliver better health outcomes."

This approach recognizes that effective AI deployment requires more than just purchasing algorithms—it requires building the infrastructure, processes, and expertise to monitor and maintain AI systems over time.

A Global Leadership Opportunity

The UK is positioned to lead global efforts in responsible AI deployment. The Stanford AI Index report identifies knowledge gaps and regulatory uncertainty as the primary barriers to AI implementation worldwide—areas where the UK's unified health system and innovative regulatory approaches provide significant advantages.

"We generally could lead globally within this space," notes Robin. "Because we've got airlock, we've got the research that's happening here, we have a fantastic NHS business, we have the NHS which is slightly more unified than other ecosystems around the world."

The MHRA AI Airlock, which allows for controlled testing of AI systems in real-world settings while maintaining regulatory oversight, represents a potentially game-changing approach to bridging the gap between AI development and deployment.

Beyond the Technology: Building AI Literacy

Effective AI deployment requires more than technological solutions—it demands a workforce prepared to work alongside AI systems. This means developing AI literacy not just among technical staff, but across the entire healthcare organization.

"We need to build that capability around AI," explains Prof. Lowe, "and kind of figure that down a little bit further, building up the digital literacy across your entire organization, across the entire life cycle."

This literacy includes understanding AI's limitations, recognizing signs of performance degradation, and maintaining appropriate skepticism about AI outputs. It's not about replacing clinical judgment with algorithmic decision-making, but about creating a partnership where human expertise and artificial intelligence complement each other effectively.

Conclusion: The Promise Remains

The promise of AI in healthcare remains compelling: more consistent diagnoses, reduced clinical workload, and improved patient outcomes. But realizing this promise requires acknowledging that deployment is not the end of the AI journey—it's the beginning of a new phase requiring different skills, processes, and vigilance.

As one expert noted during the panel discussion, "We're just going to have to move forward together." The path to successful healthcare AI isn't just about better algorithms or more data—it's about building the human and organizational capabilities to ensure that AI systems remain safe, effective, and beneficial long after they leave the development lab.

The question isn't whether healthcare AI will succeed, but whether we'll build the infrastructure and expertise to help it succeed safely. The stakes—improved patient care and clinical efficiency—are too high to accept anything less than systems that work reliably, transparently, and safely over time.

Expert Insight on the Future of Healthcare

with world-renowned experts