Why Agentic AI Isn't the Future of Healthcare

Why Agentic AI Isn't the Future of Healthcare

DATE
June 17, 2025
SHARE
The Language of Genomes

Sit in on a neuroscience lecture at the University of Cambridge, and you might be surprised by what you learn—or rather, don’t learn. “By the end of this module, you won’t understand how this works. I don’t understand how this works. No one does,” were the resounding words from a world-class professor, about to teach his students the basis of motor function.

Watch on YouTube

The brain is the least understood organ of the body—and, while some of the important bits have been pieced together, the larger jigsaw is full of gaps that scientists don’t quite know how to fill. Consciousness, for example, is an abstract concept that has long puzzled researchers, “extremely polarized” by contrasting theories that struggle to define even the word itself.

We might not fully understand how the brain works, but the line between human and artificial intelligence is becoming increasingly blurred. The question once posed by Turing, “Can machines think?”, is no longer philosophical, with AI systems performing complex tasks that closely parallel human thought processes.

Agents Are Pioneering the Future of AI

Arguably, a key “blurrer” of the line has been brought about by the birth of agents. Unlike the Q&A-style interactions of early Large Language Models, AI agents are designed to operate autonomously, carrying out tasks with little or no human input. So, while a chatbot may tell you the best way to structure an email, an AI agent could write and send the message, schedule the meeting, and reorganize your calendar around it.

Dr Bang Liu, Associate Professor in the Department of Computer Science and Operations Research, University of Montreal, explains this difference:

“Large language models like ChatGPT can communicate with us. If you send a text, it will generate text as a response. Agents have an entire system built around this—they’re a brain with a body.”

But the concept itself isn’t new. As far back as the 20th century, researchers like Jacques Ferber were exploring the foundations of agent-based systems. In 1999, Ferber defined agents as “entities with autonomy, perception, and communication capabilities,” often working in “social networks” to solve problems collaboratively. This network, similar to how an office functions with jobs dedicated to specific employees, formed the basis of a multi-agent system.

Primitive AI agents, however, lacked adaptability. Critics described them as “rule-following” and “constrained” by their environment—they stuck to the script, learning structured associations rather than digging beneath the surface to understand the underlying context.

Fast-forward to today, and the idea has matured. What were once reactive agents, working with no memory or foresight, became deliberative—the “thinkers” of the AI realm, injecting complex reasoning behind their decisions.

In a real-world scenario, reactive agents can be thought of as non-playable characters in a game: they react immediately to characters’ speech or actions in confined structures, pre-determined by the game’s programming. Deliberative agents, on the other hand, might take the form of a GPS app, calculating the fastest route based on current traffic conditions.

At the peak of this change came frameworks like AutoGPT, a “project manager” agent to tackle goals by breaking them into smaller subtasks and executing them step by step. Alongside it, BabyAGI introduced a feedback-loop approach, allowing agents to track progress, revise priorities, and adapt their plans dynamically.

AI Agents vs Agentic AI

This transition, from reactive to deliberative, has drawn a subtle, yet important, line between two concepts: AI agents and agentic AI. In a recent paper submitted to arXiv, AI agents are defined as single-entity systems that use tools, think through steps, and use up-to-date information to carry out specific tasks. On the other hand, the authors describe agentic AI as “multiple, specialized agents that coordinate, communicate, and dynamically allocate subtasks within a broader workflow.”

In essence, the two diverge both in terms of capability and scale, with agents referring to the singular systems that make—albeit sometimes complicated—decisions alone, while agentic AI denotes a network of artificial agents, working together to truly embody the word “autonomy.”

“When we think of AI agents, we mainly talk about the agent itself. It is an intelligent being that can perceive inputs from the environment," says Bang. “But an agentic system has many different components. Imagine breaking down a complex task into different subparts: each agent in the system has specific functionalities. These agents can work together to form effective protocols and organizational structures.”

Agentic AI Through the Healthcare Lens

Although a relatively new idea, agentic AI is already gaining momentum in the healthcare space—at the peak, tech giants Microsoft and Apple have taken considerable strides.

Just this month, Microsoft announced a new agentic AI orchestrator in its Azure AI Foundry Agent Catalog. Leaning into the “orchestrator” concept, they want to help agents work together to solve complicated tasks, starting with cancer care. In its initial stages, Microsoft will test a platform for supporting Tumor Board specialists—groups of cancer specialists from different fields who come together to review cases and decide the best treatment plans.

Joshua Warner, MD, a radiologist at the University of Wisconsin Health, told press that the initiative will help “surface, summarize, and take action” on multimodal medical data, meaning data across different types, such as doctors’ notes, blood test results, or imaging scans.

The aim? To decrease the time of Tumor Board review from hours to just minutes.

Similarly, Apple is making strides in the agentic AI space with “Project Mulberry”—a health-focused initiative to develop an AI health coach that acts like a virtual doctor. With millions of Apple Watch users worldwide tracking metrics like heart rate, sleep, and blood oxygen levels, Apple has access to a vast reservoir of health data to power this tool. The project is expected to launch with iOS 19.4 in spring or summer 2026, with Apple hoping to shift from “reactive data collection” towards “proactive health management.”

And, across the globe, bigger projects are unfolding. In 2024, researchers from Tsinghua University, China, launched the world’s first AI agent hospital—a digital institution run by 14 AI doctors and four nurses. These agents could diagnose and treat 10,000 patients—still, AI models—in the span of a few days, something that would ordinarily take doctors two weeks. Later, in April 2025, it was announced that the hospital would expand to 42 agents, all simulating medical scenarios with their AI patient counterparts.

Consistent with the role of a simulation, the hospital gives both training medical students and practicing physicians a better grasp of what kind of treatment pathways yield what outcomes, and what situations call for what testing. Think of it like a training ground—if young physicians can actually see what happens when they make different clinical decisions, rather than relying on textbook illustrations, they’ll feel much more prepared when it comes to the “real deal.”

While not an actual hospital, the idea is certainly startling—the once-futuristic movie scenes of “robot doctors” treating patients now transcend the screen, almost within reality’s grasp.

Black Boxes: When the Danger Is Invisible

But with these exciting developments come some careful considerations. In a healthcare context, the idea of autonomous AI—handling complete processes like diagnostic tests or patient treatment—is considerably more daunting than in other sectors. For example, in a courtroom, an AI agent’s error might have legal implications, but the risk pales in comparison to human health—say, if an AI model incorrectly dismissed a patient’s case as non-cancerous.

One of the major concerns is transparency—something that complex multi-agent systems lack. In a typical clinical AI setting, if a doctor can see how an AI model arrived at its answer, the mistake can be corrected, for example, by picking out the signs of cancer that the AI neglected. But the story shifts when we consider the nuances of trust between humans and machines.

Both human doctors and AI systems are likely to make errors at some point—even with a 99% accuracy rating. However, disagreements between the two, e.g., if a biopsy is cancerous or not, carve out space for an entirely new type of error: uncertainty.

Say an AI model incorrectly labels a tissue sample as “benign.” A doctor might—correctly—think the opposite is true, i.e., the sample is cancerous. But, swayed by a model’s “pearly” 99% accuracy rating, the doctor might also think, “the AI must be right,” and override their clinical judgement in favour of the model’s. Here, the patient’s cancer would go undiagnosed and, crucially, untreated.

This problem was highlighted by a 2024 commentary in Nature Communications, referring to the phenomenon as a “false conflict error,” where clinicians defer to incorrect AI decisions even when their own judgment is right.

Although the “low-performing” clinicians made better decisions with AI, the authors noted that the best clinicians lose accuracy because of this phenomenon. By themselves, the high-performing doctors made 26 incorrect judgment calls; when supported by the AI, this number rose to 38—almost a 50% increase.

The paper points to a major culprit: Black-box logic. When working with complex models, e.g., convolutional neural networks, researchers often only see the input and the output. What happens between these two points is a mystery—a “black box” that prompts the question: Just how does it know that?

But the issue becomes three-dimensional when it comes to agentic AI. If a network of black-box agents makes an error, it would be very difficult to pinpoint where the system has failed—there would be too many avenues to explore.

Not only does this lack of transparency waste time spent on trying to understand if an AI is right or not, but it sows doubt into the minds of highly skilled medics. As a result, doctors might be left to over-scrutinize the system’s decision or even question their own medical “utility,” deepening the valley of mistrust between doctor and machine.

It’s because of this that we haven’t seen much progress in the application of agentic systems to healthcare—certainly compared to administrative tasks with lower stakes.

Where Are We Headed?

And, beyond transparency, there are some major obstacles ahead of agentic AI. So, what would it take to build AI agents that are both powerful and trustworthy?

Recently, Dr Bang Liu and his team published a paper in arXiv, “Advances and Challenges in Foundation Agents,” exploring the future directions these systems might take.

Specifically, they envision two milestones for the future of AI agents: (1) an age of general-purpose agents, or "artificial general intelligence,” capable of handling an array of human-level tasks rather than being “confined to specific domains,” and (2) self-evolving agents, that learn directly from their environment and continuously improve through interactions with both humans and the data they are exposed to.

To achieve these, however, requires an integration of features:

  1. Advanced reasoning. Central to a general-purpose agent is the ability to make decisions beyond pattern recognition—something that, at the moment, artificially intelligent systems cannot do. Large Language Models have big knowledge gaps; they understand things superficially, but lack a deep comprehension of the way the world works. This means they struggle to understand causal relationships: a critical skill in clinical settings. For example, a patient might have symptoms that point to a particular disease, except for one that doesn’t slot in with the rest. That one symptom might indicate something else entirely—perhaps a rare disease that doctors don’t often encounter on their day-to-day. Following the laws of probability, an AI model might ignore this one symptom in favour of the most likely outcome. But a human doctor has the creativity and deep reasoning to “go against the grain,” and truly uncover the right outcome.

    To move toward this milestone, AI must go beyond pre-trained outputs and into real-time, context-aware problem solving, combining symbolic logic, world knowledge, and probabilistic inference. But, of course, building bigger, more complex systems requires an immense amount of computational power.

    2. Perception. To make a decision, like when to eat, humans don’t just rely on one sense—i.e., how good does this meal look? We integrate multiple inputs from our environment, like visual, auditory and olfactory information. Crucially, though, we also rely on internal cues, like hunger levels, emotional status and past experiences within a similar context.

    Consider this: someone that once had food poisoning from a meal is much less likely to eat the same meal again, even if they are very hungry or the meal now in front of them objectively looks appetizing. Someone who fondly remembers eating the meal as a child, however, is much more likely to eat it again.

    This makes human decision-making incredibly complex—nearly impossible to simulate in the artificial world. Multimodal data processors do, nonetheless, exist; for instance, Unified-IO 2 is a model that can understand and generate images, text, audio, and actions, by treating each input as a token and representing them in a shared representational space.

    Still, these systems face obstacles. For example, Bang and his co-authors point out that they often fail to capture the richness of multimodal data, with vital information sometimes slipping through the cracks.

    Additionally, systems that try to integrate lots of sensory information are prone to hallucinations—something that becomes especially visible with large language models. As Bang et al. note, large language models cannot yet hold long-term memories without “hallucinating,” something that—they theorize—could be addressed with better data alignment frameworks, dynamic networks that adapt easily to the environment they’re in, and strong memory networks that allow agents to remember more.

    3. Self-evolution. This is fundamental to human intelligence—it’s how we learn and evolve. Like learning to ride a bike: we fall, scrape our knees, but each tumble teaches us balance, until one day we’re riding smoothly without even thinking. Tied to all aspects of an agent’s “modules,” perception, reasoning, and action, it’s arguable that self-evolution will be a core requirement for agents as the field unravels.

    Two emerging strategies for self-evolution in AI agents are online and offline self-improvement—the former driven by real-time feedback, and the latter built on a stable, pre-trained foundation.

    Offline self-improvement involves training agents on curated datasets or simulated environments before deployment. This approach is safer and easier to evaluate, making it ideal for controlled learning. For instance, an AI healthcare assistant might be trained on thousands of clinical cases before it's ever used in a real hospital setting. But once deployed, its performance remains relatively static unless it’s retrained manually or periodically updated.

    In contrast, online self-improvement allows agents to learn continuously during use, adapting to their environment, user behavior, or new situations on the fly. This mirrors how humans refine their skills through daily experience. Imagine a customer service agent that gradually gets better at handling nuanced complaints the more it interacts with users. While powerful, this strategy introduces significant challenges: ensuring safety, preventing drift from intended behavior, and maintaining reliability over time.

    Ultimately, enabling robust self-evolution may require a hybrid approach—one that combines the structured knowledge and safety of offline training with the flexibility and adaptability of online learning. Achieving this balance is key to building agents that not only perform tasks intelligently, but also grow smarter over time.

    4. Action modules. Bang and colleagues describe action as “the decision-making process through which an agent interacts with its environment.” In other words, action is where an agent’s reasoning meets reality. It's one thing for an agent to understand a user’s request or plan a response—it’s another to do something meaningful with that understanding.

    Whether it’s writing and sending an email, booking a flight, or controlling a robotic arm, the action module is what turns intention into impact.

    Right now, AI agents mostly learn how to act by reading instructions. But in real life, people learn to act by watching, trying, or being shown—not just reading. To make AI agents more capable and practical, we need to teach them how to act using all kinds of inputs, like images, videos, or demonstrations, not just text.Other issues for “action” modules, highlighted by Bang and colleagues, include efficiency, evaluation (equipping agents with a framework that lets them measure and adjust their performance), and privacy (keeping sensitive user information secure).

Safety is a Top Priority

But perhaps the most pressing in healthcare, a field where life and death are thinly separated, is the issue of safety.

With scientists moving away from single-entity “agents” and towards multi-agent networks,  concerns about reliability are growing.

A recent study into multi-agent systems found that when a network of agents was tasked with identifying “names”—essentially, strings of characters or symbols—something unexpected happened. As the agents interacted, not only did social “norms” begin to form, but so too did collective biases—seemingly out of thin air. These biases, crucially, were not present when the agents were tested individually, suggesting that the act of working together caused them to emerge—much like how a choir, made up of strong solo singers, can fall out of tune if each voice starts adjusting to the others rather than the original pitch.

Although the agents were only choosing names in this experiment, the implications are far-reaching: Could biases that don’t appear in individual AI systems emerge spontaneously in multi-agent networks, simply through interaction? In healthcare, where such systems may one day guide clinical decisions, the stakes are high. For example, in the testing stage, it might seem that an agent accurately triages patients from all backgrounds—but collective biases might emerge over time so that certain people are discriminated against.

Such a study leans into a more deep-rooted problem: misalignment. While individual agents might appear to act in line with human goals, interactions between the agents can give rise to behaviors no one explicitly programmed—or even predicted.

And this issue goes hand-in-hand with transparency. As discussed earlier, an AI model making the wrong call can often be corrected—in this case, by a human doctor. But when a whole AI network “takes the reins,” without us being able to understand why, it’s difficult to catch these mistakes.

Bang explains: “The reliability of agents is very important. Even for single agents, we cannot trust all of their outputs. If you have more roles and more agents, especially if they are from different agent providers, we’re making the system more complex. And this complexity naturally leads to less reliability.”

This growing complexity leads to a deeper, more systemic concern: misalignment. While individual agents may appear to act in line with human goals, their interactions can give rise to

Could Neurodiversity Solve AI Misalignment?

While transparency is something that could be addressed with relative ease, misalignment is a much more systemic issue—a stubborn “weed” that is deeply rooted in the very foundations of AI.

In a recent paper, Dr Hector Zenil, Associate Professor and Senior Lecturer at King’s College London, explores a bold idea that could address the problem: neurodiversity. Perfect alignment, the authors argue, is unattainable—any sufficiently advanced AI capable of artificial general intelligence is bound to diverge from what they were programmed to do. In fact, Dr Zenil and co didn’t just argue that alignment is hard to achieve, they proved it’s mathematically impossible.

But, instead of seeing this as an obstacle, they used it to their advantage. The team created a network of AI agents, with some aligned with human interests, and others that were not—which they refer to as “neurodiverse.” They then let them debate, watching to see if the aligned agents could persuade the others—which they did.

This dynamic, they wrote, encouraged diverse viewpoints and even led to changes in agents' behavior over time, much like how diverse human teams are often better equipped to reach fairer, more balanced decisions.

That said, the approach runs into roadblocks with closed, proprietary systems. These often come with strict constraints that limit the agents’ ability to explore or adapt freely—highlighting the downside of trying to tightly control how AI behaves.

Regardless, it shows how modeling networks on real human communities, by introducing agents that think differently to the rest of the group, could potentially prevent dangerous ideas, like collective bias or polarized political beliefs, from spawning and circulating in agentic systems.

Should We Be Chasing Humanistic Intelligence?

With these artificial systems seemingly getting closer and closer to human-like intelligence, it raises a question of what we are actually aiming for with building these intelligent systems?

But, although some aspects of agentic AI sound remarkably human, Bang emphasizes that the bigger goal is not to replicate every feature of the human brain, but—rather—understand the features that are actually advantageous to these artificial systems.

“If we can bridge the gap in a beneficial direction, we hope that AI can be safe, understandable to ourselves, and can be controlled,” he says.

Such a sentiment was echoed by Dr Zeni, also associated with the King’s Institute for AI and founder of Oxford Immune Algorithmics:

“We tend to believe that human intelligence is the pinnacle of intelligence, but it has many faults and we should strive higher.”

In another recent paper, he proposed a measure for artificial super intelligence, showing the very poor performance of current leading LLMs.

He enforces that while we want AI to have certain human traits, it’s “undesirable” to inherit other traits, like biases and human mistakes. He suggests that, instead, we should “go beyond human intelligence,” leaning into the novel concept of artificial super intelligence—intelligence that surpasses the brightest humans on earth.

Conclusions

But of course, for the moment, agentic AI is still in its infancy.

“We’re not there yet,” said Dr Pooja Sikka —General Partner at Meridian Health Ventures—at Gen101’s Agentic AI in Healthcare event earlier this month, “We still have a long way to go before we can unlock the true potential of agentic AI.”

Still, this early stage may be a blessing in disguise. As agentic AI begins to find its footing, we have a rare window to shape its future deliberately.

The path ahead is not just about building smarter machines, but about deciding what kind of intelligence we want to foster: one that mirrors our own flaws, or one that helps us move beyond them. What we choose now will shape not only the future of technology, but the future of how we understand—and extend—our own intelligence.