The Race to the First Virtual Cell

DATE

AUTHOR

The Language of Genomes

Scientists have long understood the basic parts of the human cell. The nucleus was discovered in 1833; the maze-like Golgi apparatus in 1898. But knowing a cell’s parts and functions only takes us so far. For more than a century, scientists have been engaged in a much more difficult journey of discovery—figuring out how cells interact with one another across tissues and organs, how they behave, and why they behave a certain way in response to diseases and drugs. Each cell is its own complex world, interacting with other cellular worlds.

For instance, we know that metabolism —a dense network of biochemical reactions— is important for understanding disease. But metabolic processes differ depending on cell type and tissue environment. And we now know that the same proteins can be found in different parts of the cell, performing entirely distinct functions.

To model that complexity requires a virtual cell—a digital model that can simulate cell functions and interactions in many potential states.

In December 2024, the journal Cell published a paper from 42 prominent researchers across Stanford, Columbia, Genentech, the Chan Zuckerberg Initiative, and elsewhere, laying out the first blueprint for this virtual cell. Cell behaviors are non-linear, the authors write, and happen on multiple scales simultaneously. Even “small changes in inputs can lead to complex changes in outputs.”

But now, thanks to AI and an enhanced ability to capture biological data, we can train machine learning models to recognize patterns and make predictions “without needing explicit rules or human annotation.”

In those blank spaces where scientists have an incomplete understanding, the virtual cell could provide new insights—identifying the underlying patterns in the data that no human mind could, fueling new hypotheses, and supercharging the discovery of new treatments.

This is the quest on which a number of our greatest living scientists have now set themselves. And thanks to recent advances in AI and omics (the studies of different cell functions such as genomics, metabolomics, proteomics, and transcriptomics), the virtual cell is now coming into view.

According to Mohammed AlQuraishi, Assistant Professor of Systems Biology at Columbia University and one of the authors on the Cell paper, the first virtual cell could arrive in 15 years or less. “This field is moving at the speed of light,” he said.

Unlocking the Human Genome

To get here required a series of major scientific breakthroughs over the past several decades in mapping portions of those complex biological pieces—often despite significant opposition and skepticism.

In the 1990s, arguably the most important biomedical research undertaking was the Human Genome Project—a driver for understanding the genetic underpinnings of disease and more accurate diagnosis. This project, too, began as a seemingly impossible quest—a 13-year journey of discovery by an international team of researchers to generate the first sequence of the human genome.

The genome is our human blueprint—a complete set of DNA instructions for every cell in the body. It consists of 23 pairs of chromosomes located in the cell’s nucleus and a small chromosome in the cell’s mitochondria —altogether, these structures contain around three billion nucleotides, holding all of the information needed for a person to grow and function.

*Visualization of DNA. Image created with Flux1.1 Pro Ultra.*

To unlock it meant we could understand the genomic drivers of cancer—for example, what “normal” genetic sequences look like and how mutations in certain genes cause cancer cells to replicate uncontrollably. Beyond this, genetic insights from the Human Genome Project have allowed patients with rare disorders —over 7,000 diseases affecting 350 million people around the world— to understand the genetic causes of their diseases for the first time and to find others like them. At the same time, it springboarded research, enabling scientists to discover new treatments that were previously out of reach.

Stephen Quake, Lee Otterson Professor of Bioengineering and Professor of Applied Physics at Stanford University, co-President of the Chan Zuckerberg Biohub, and another author on the Cell paper, said “the genius” of the Human Genome Project “was to turn it from a biology problem to a chemistry problem. There is a test tube with a chemical, and it works out the structure of that chemical. And if you can do that, the problem is solved.”

But the Human Genome Project spawned backlash from its outset. Since only 2% of the genome codes for proteins, opposing scientists called the rest “junk DNA” and argued that the project wasn’t worth its $3 billion price tag. 55 people from 33 academic institutions wrote letters in opposition.

The incident shines a light on what happens whenever there’s a significant challenge to the way scientific inquiry is conducted. First, it’s deemed impossible and foolhardy. Later, it’s hailed as genius.

We can trace the Human Genome Project, completed in 2003, to more than two decades later, when CRISPR-Cas9’s seminal breakthrough showed how those genes in the human cell are expressed and understand what their function is. Developed by Nobel Prize winners Jennifer Doudna and Emmanuelle Charpentier, CRISPR-Cas9 uses the Cas9 protein like molecular scissors to cut precise locations in DNA.

*CRISPR-Cas9 enables researchers to isolate genes of interest. Diagram created with* *Sora*.

In 2022, researchers at Massacchusetts Institute of Technology (MIT) used this DNA-cutting technique to capture information about how ribonucleic acid, or RNA —the molecule responsible for how genes behave— responded to specific cuts via a tool called Perturb-seq, using it to investigate over 2.5 million human cells, producing a data-rich map of gene functions.

The Human Genome Project is now one of many substantial datasets —along with the haplotype map, or HapMap— which helps researchers find specific variations that impact disease, the Cancer Genome Atlas, and the Human Protein Atlas, that are being tapped to train machine learning models.

The Protein Folding Puzzle

Another piece of the puzzle required a 50-year scientific journey: figuring out how proteins fold. Proteins are made up of chains of 20 amino acids whose kinetic interactions determine how the chain will fold —in milliseconds— into a particular three-dimensional shape.

Researchers knew that misfolded proteins cause neurological disease, including Alzheimer’s and Parkinson’s. And that, when proteins build up due to misfolding, it can lead to Type 2 diabetes, inherited cataracts, and certain forms of atherosclerosis (hardening of the arteries). Protein misfolding is also associated with cancer proliferation.

But there are billions of protein sequences, and to understand how they work, we need to know their precise structure—another seemingly uncrackable code.

This protein structure problem was the earlier obsession of a number of scientists who are now working on the virtual cell— among them Dr. Mohammed AlQuraishi at Columbia University. “What really attracted me to biology was this idea that you have a code written in another language that nobody fully understands,” he says. “Protein structure, in particular, struck me as this prism through which to view all of biology.”

It could take a PhD student the entire length of his or her degree program to determine the structure of just one protein. To understand the structure of 200 million known proteins, we needed AI.

That AI tool —AlphaFold— came in 2020 from Google DeepMind, and its second iteration —AlphaFold2— in 2021. A complex program of 32 algorithms stitched together, it could accurately predict the three-dimensional structures of all 200 million known proteins in minutes. The model is now on its third iteration, with significant improvement in the accurate prediction of protein interactions, and AlphaFold2 has now been used across a vivid research landscape, including work on malaria vaccines and cancer treatments.

‍Demis Hassabis, PhD and John Jumper, PhD shared the Nobel prize in Chemistry in 2024 for their work on AlphaFold, and for solving what the academy calls “the great challenge of biochemistry: the prediction problem.” Hassabis is also Founder and CEO of Isomorphic Labs, an AI drug discovery company launched in 2021 that just raised $600 million in its first external funding round.

Charting Pathways

In January 2025, in a company presentation at the J.P. Morgan Healthcare Conference —the annual gathering of healthcare companies and investors that’s become a showcase for new technology breakthroughs in the sector— Chris Gibson, PhD, the Cofounder and CEO of Recursion, shared his vision for the virtual cell. To build it, he said, requires a series of interconnected layers currently in development—patient models, pathway models, protein models, and atomistic models.

Recursion is one of a handful of TechBio companies—companies that take a data- and AI-led approach to drug discovery— making meaningful advances to develop a virtual cell. To train models effectively on biological processes requires a significant amount of “clean” data —data that is relatable and scalable— captured under precise conditions in an automated fashion. It’s a long slog to get there, but once you’ve built it, you can add in “messier” data layers—like real-world patient data—and see leaps of improvement in the number of insights.

For more than a decade, Recursion has been building that clean dataset, capturing millions of images each week in robot- and computer vision-equipped labs of different types of human cells and under various states of perturbation (possible thanks to CRISPR Cas-9 editing), building up a 65-petabyte proprietary database of biological and chemical information that’s designed for machine learning interpretation.

Using this data, and the processing power of a massive supercomputer, the company’s researchers are actively building machine learning foundation models —the pathway models that Gibson described, able to accurately simulate the biological and chemical worlds of drug discovery—that can predict, for instance, how a protein will bind or whether a particular compound will be toxic. The company takes these models and applies them to its Recursion Operating System to drive deeper insights.

Eventually, Gibson told the audience, the company’s wet labs will no longer be producing data to build models but to validate the predictions of the virtual cell. “I believe very deeply this is what the future of our industry will be,” he said.

Leveling the Protein Modeling Playing Field

Protein modelling is now well underway following Google DeepMind’s 2018 appearance —and surprise win by a large margin— at the community-wide experiment, Critical Assessment of Structure Prediction.

In a related blog post titled “What just happened?” AlQuaraishi called DeepMind’s win a “wake-up call for scientists,” arguing that what has held back protein prediction for academics has been secrecy and the need to “rediscover the wheel over and over.”

There’s since been a movement toward the democratization of biological tools led by AlQuraishi, Arzeda (cofounded by David Baker, Director of the Institute for Protein Design at University of Washington), and others. Founded in 2022, it’s known as the OpenFold Consortium—a nonprofit AI R&D consortium “developing free open-source software tools for biology and drug discovery.” They are pooling resources to rapidly advance the field, bringing in pharma and tech giants like Bayer and NVIDIA alongside TechBio startups and academic researchers.

And it’s working.

In 2024, the collaboration released its own version of AlphaFold2, OpenFold, for the scientific community that matches AlphaFold2’s accuracy.

But while these models are very good at predicting protein structures, there remained a question of how proteins interact with drugs in the human body. Until now, AlphaFold and OpenFold have relied on 200,000 structures from the publicly available Protein Data Bank.

Pharma companies, meanwhile, have enormous amounts of that missing protein-drug binding data behind their digital walls. On March 27, 2025, the OpenFold Consortium announced that Abbvie and Johnson & Johnson are making that proprietary data available to further train OpenFold3—adding tens of thousands of additional protein structures. They will use a platform developed by the startup Apheris to ensure their data remains protected.

“We expect that by training on proprietary data, the model will become more capable on hard problems that AlphaFold3-based models struggle with, such as predicting small-molecule protein complexes, said AlQuraishi in GEN.

Simulating the Atom Dance

“The physical world is quantum mechanical,” said Nobel Prize-winning physicist Richard Feynman in a keynote in 1981.

Cells and their many parts are not static. Even with more data and enhanced understanding of individual cell types and protein interactions, to simulate a cell requires bringing all of these pieces together in a way that captures how atoms in molecules behave in response to chemicals and proteins. There’s not just form and function to consider—but time.

“A cell might react to a drug perturbation within a couple of minutes—or it could be days, or weeks, or a month,” says Emma Lundberg, PhD, Associate Professor of Bioengineering and Pathology at Stanford University, co-director of the Human Protein Atlas and another of the Cell paper’s 42 authors. “I’m so used to thinking about the spatial axis and how cells are spatially organized. But the temporal dimensions are even bigger than the spatial dimensions.”

To model the molecular behavior across time and space requires computational power beyond that of supercomputers. For this, we need to enlist a quantum approach. Professor Anthony Laing of the Quantum Engineering and Technology Labs at the University of Bristol describes the atoms in molecules as being connected by springs and engaged in a “complicated dance routine.”

He continues: “At a quantum level, the energy of the dance goes up or down in well-defined levels, as if the beat of the music has moved up or down a notch. Each notch represents a quantum of vibration.”

*Visualization representing a molecular quantum simulation. Image created with Flux1.1 Pro Ultra.*

In 2018, he and researchers from MIT, Nokia Bell Labs, and other institutions published a new method for simulating the quantum dynamic behavior of molecules using a photonic chip—essentially a single particle of light. This reprogrammable chip captures a frame-by-frame “virtual movie” of the molecule's quantum movements.

Classical machine learning algorithms have been compared to following a single path through a maze, while a quantum algorithm allows researchers to explore many paths —the entire dance— simultaneously. This is due to certain properties of quantum computers—which, instead of processing information sequentially as bits, use quantum bits or qubits that process many possibilities via properties like entanglement and superposition.

In a recent breakthrough, statisticians at the University of Georgia designed a new quantum algorithm that is able to effectively analyze RNA and protein expression from cells—identifying the most important markers from billions of possibilities coming from a single-cell phenotyping method known as CITE-seq—or Cellular Indexing of Transcriptomes and Epitopes by Sequencing.

Study author Professor Wenxuan Zhong said: "The unique characteristics of quantum algorithms make them especially well-suited to tackle complex genomic and transcriptomic problems, where the combinations and interactions of genetic markers or sequences can be vast and computationally demanding."

In March 2024, an international team that included researchers from Boehringer Ingelheim, Palo Alto-based quantum computing software company QC Ware, Austria-based quantum computing company AQT, and German polymer company Covestro presented a hybrid quantum-classical algorithm able to simulate the electrostatic interaction between large molecules. They used a trapped-ion quantum computer that relies on charged atomic particles held in place by electromagnetic fields. The quantum device was able to yield results that were significantly better than those based on classical theory.

These are important first steps in modeling the complex behaviors of atoms and molecules and the final frontier in building a virtual cell.

“If we can predict the structure of molecules, then we can next predict how molecular machines assemble,” says AlQuraishi. “Next, we predict the motion and function of those machines, and we keep building our way up until we’ve captured the entire complexity of the cell. This would completely change how we study disease and design drugs.”

About the Author

‍Brita Belli is an award-winning science and tech writer and Senior Manager of Communications at Recursion, a clinical-stage TechBio company. Her writing has been featured in the New York Times, National Geographic, MSN.com, and Alternet, as well as trade publications like OnDrugDelivery, European Biopharmaceutical Review, and OR Manager Magazine. She’s the author of The Autism Puzzle: Connecting the Dots Between Environmental Toxins and Rising Autism Rates (Seven Stories Press).

Expert Insight on the Future of Healthcare

with world-renowned experts