How Open Data Is Fueling the AI Drug Discovery Era

How Open Data Is Fueling the AI Drug Discovery Era

DATE
August 19, 2025
SHARE
The Language of Genomes

In 2008, researchers studying glioblastoma – the most aggressive type of malignant brain tumor –  sequenced 20,000 genes from glioma samples from the Cancer Genome Atlas Program. Gliomas are tumors from the brain or spinal cord that can either remain stable (grade 1), or progress to deadly glioblastomas (grade 4). They wanted to determine whether there might be a genetic driver determining which path the glioma took.

The CGAP was an effort launched by the National Cancer Institute and the National Human Genome Research Institute just two years prior. It provided open access to the genes of over 20,000 primary cancer samples across 33 cancer types. Before this database was available, researchers had to rely on the histology of brain tumors – the way the tissue appeared under a microscope – which gave little indication of how the disease might progress.

With this data, the researchers made a critical discovery – a mutation in the Isocitrate Dehydrogenase 1 (IDH1) gene – the first discovery of a link between a metabolic gene (a gene that codes for protein) and cancer development.

Dr. Hai Yan, a professor at Duke University who co-led these studies said that this work “not only illuminated the molecular landscape of these  tumors, but also provided invaluable insights for future research and potential treatment strategies.”

Follow-up studies confirmed that IDH1 and 2 mutations are found in the majority (75%) of these aggressive brain tumors. And later studies found that IDH mutations also play a key role in many other cancers, including acute myeloid leukemia, thyroid cancer, prostate cancer, bone cancer, and a type of liver cancer.

This discovery has driven the development of a new class of drugs designed to inhibit IDH1 – including ivosidenib/Tibsovo® – which have since become the standard of care for patients with these mutations.

The Search for New Drugs with AI Begins with Data

Increasingly today, drug discovery begins with data. And when it comes to understanding biology and training machine learning models, the need for high quality data runs deep. A number of academic research groups, government agencies, and even pharma and biotech companies, have been working on building publicly available data repositories to supercharge this new era of AI-based drug discovery.

One of the most widely used data sources is the Protein Data Bank – the very first open-access digital data resource in biology, established in 1971 with just seven protein structures. Today it contains over 200,000 3-D structures for proteins and nucleic acids and has contributed to – by one 2020 study estimate – over 90% of new cancer drugs approved by the FDA between 2010 and 2018. The study found the same to be true across nearly all therapeutic areas. The PDB has helped researchers to design new chemical structures, to reveal potential small molecule binding sites, and to identify lead drug candidates.

AlphaFold, the AI protein prediction tool from Google DeepMind which contains over 214 million structures in its database, was trained on protein chains from the PDB. But the latest version of the tool – AlphaFold 3 – was published in Nature without the open-source code, prompting outcry from scientists – 1,000 of whom signed a letter in protest. In response, the company released AlphaFold3 on GitHub for use by academics and non-commercial purposes only. Researchers are restricted to 10 predictions per day and cannot calculate protein structures  associated with possible drugs to avoid competing interests with DeepMind subsidiary Isomorphic Labs.

We need more access, not less, to make significant inroads in finding new disease targets and designing and accelerating new drugs.

The recent release of the open-source Boltz-2 foundation model from scientists at MIT and Recursion made both protein structure prediction and protein binding affinity prediction available to the public at unprecedented speeds for the first time.

“Binding affinity is core to developing a therapeutic start to finish and has been the fundamental issue that a lot of us have been trying to grapple ,” said Recursion’s Chief R&D Officer and Commercial Officer, Najat Khan, PhD. “The value of this collaboration is significant technological advancement geared to the purpose of application, which is drug discovery.”

Since its release in June 2025, Boltz-2 has been downloaded almost 200,000 times by over 50,000 unique users, and a number of companies – including Tamarind Bio, Rowan, deepmirror, and ReSync Bio – have added Boltz-2 to their platforms.

Expanding Public Access to Proprietary Data

While open access tools like Boltz-2 are enabling faster discovery, more data is needed. Data that accurately represents the different interconnected pathways of disease development – from genes, transcriptomes and proteomes, to tissues, organs and patients.

A lot of existing data lives in proprietary siloes at pharmaceutical companies. A new initiative called the AI Structural Biology (AISB) Network is making that rich repository available for the first time. Launched by Berlin-based startup Apheris in March 2025 along with the OpenFold Consortium – a major driver of free, open-source AI tools for biology and drug discovery — the initiative has brought in pharma heavyweights like AbbVie, Sanofi, Bristol Myers Squibb, AstraZeneca, Boehringer Ingelheim and Johnson & Johnson who for the first time are allowing researchers to use their proprietary protein structure data to train their AI models under a “federated learning” approach that preserves confidentiality and prevents proprietary information from being revealed.

With federated data networks, computations happen locally, where the data lives, and only the outputs are centrally aggregated. “Access to industry data for benchmarking is a huge value-add for everyone who builds models,” Apheris co-founder and CEO Robin Röhm told GEN.

And some TechBio companies are making their proprietary datasets – in full, or part – open-source, in order to accelerate the broader AI drug discovery field and drive the development of new machine learning models. In January 2025 in collaboration with NVIDIA, the company Vevo Therapeutics made its 100-million-cell dataset – the world’s largest map of single-cell transcriptomic data – open to the public. The atlas – which includes 1,200 drug treatments across 50 different tumor models – shows how drugs interact with patient cells.

The dataset enables researchers to look at not only how drug molecules might work for one type of patient, but how it impacts the cells of hundreds of patients with different genetic variations of a disease at the single-cell level – producing millions of relevant datapoints.

“Open sourcing a dataset of this magnitude is a momentous step towards creating a more open and collaborative community in biological research, which can ultimately help us design better therapeutics for patients,” said Nima Alidoust, cofounder and CEO of Vevo, adding that it “reflects our commitment to enabling researchers worldwide to build innovative AI models."

A similar impulse drove Recursion to release portions of its datasets to the public. The company has accumulated a 65+ petabyte proprietary biological and chemical dataset – including a massive phenomics dataset via millions of weekly experiments run in its automated wet labs. To date, it’s released six open datasets of cells treated with a range of perturbations, including siRNA and CRISPR/Cas9 genetic knockdown/knockout, small molecules at multiple concentrations, SARS-CoV-2 virus, immune-focused proteins, and a model of COVID-19 cytokine storm. The largest of these, RxRx3, is a more than 100 Tb dataset spanning over 17,000 genes (CRISPR knockouts of most of the human genome) and 2.2 million images of HUVEC cells. It’s one of the largest public collections of cellular screening data generated from a common experimental protocol in a single lab, although it represents less than 1% of Recursion’s total dataset.

And in February 2025, the Chan Zuckerberg Initiative launched one of the most ambitious efforts toward open-source cellular data: the Billion Cells Project. Along with 10x Genomics and Ultima Genomics and research partners, they are building a single-cell dataset of 1 billion cells to train AI models. The data will be standardized and cohesive, demonstrating, for instance, how genetic perturbations manifest across diverse cell types and tissues. Collaborator Alexander Marson, Director of the Gladstone-UCSF Institute of Genomic Immunology, said the project will serve as a “functional roadmap to guide drug development, identifying targets to restore diseased cells to health.”

Global Initiatives to Deliver More Public Datasets for AI Drug Discovery

During London Tech Week in June 2025, the UK government made its own big announcement regarding public data for AI drug discovery: the formation of the OpenBind consortium. Backed by an £8 million government investment, the consortium will leverage automated chemistry and high-throughput X-ray crystallography to generate more than 500,000 protein-ligand complex structures along with their affinity measurements. This new dataset would represent a 20-fold increase over all public data produced in the last half-century.

The effort is led by some of AI drug discovery’s leading investigators, including Nobel Prize winner David Baker, Head of the Institute for Protein Design at University of Washington; Mohammed AlQuraishi, founder of OpenFold; and Paul Brennan, Chief Scientific Officer of the Oxford Drug Discovery Institute at University of Oxford.

“The task of predicting structures of molecules bound to proteins is challenged by a severe paucity of data, crucial for training data-hungry machine learning models such as OpenFold3,” says AlQuraishi, professor, Departments of Systems Biology and Computer Science at Columbia University. “The OpenBind project is poised to transform this dynamic, first by providing significant amounts of new and diverse structural data to fuel machine learning, and second by working synergistically with OpenFold to focus data acquisition on molecules and proteins with the greatest potential for improving the accuracy of predictive models.”

OpenBind will join other government-funded public databases that have been critical to AI drug discovery, including ChEMBL, a manually curated database of 2.5 million distinct compounds launched in 2009 that’s hosted by the European Bioinformatics Institute (EMBL-EBI) and is widely used to train AI models to design new druglike compounds with preferred properties.

In July 2025, the U.S. government unveiled an AI Action Plan that includes support for open models and open access to data as part of the National AI Research Resource (NAIRR) pilot. There are currently 21 datasets listed among the resources to help spur new AI-led discoveries and startups, including the NIH Medical Imaging and Data Resource Center – an open medical imaging data repository, and the NIH National Clinical Cohort Collaborative (N3C) –  a centralized clinical data repository that is the largest open-science resource of real-world data (RWD) in the U.S. Initially created to guide scientific collaboration in response to the COVID-19 pandemic, N3C has since expanded its scope to include cancer, Alzheimer’s, and other diseases and includes de-identified clinical data from Medicare and Medicaid claims data, national mortality data, imaging data, and viral genome sequences and EHR data from over 80 healthcare systems.

Leveraging Public Data to Match Right Drug with the Right Patient

Data drives decision-making at every stage of the AI drug discovery pipeline process – not only in target discovery, drug repurposing, and the design of new and improved drug-like molecules, but also in identifying biomarkers (such as specific genetic mutations) that can inform the patient populations who will be most likely to benefit from a new drug in clinical trials. In fact, studies have shown that “gene expression data is the most predictive genomic feature of cancer vulnerabilities and drug response.”

When scientists at Recursion were looking for the right patient populations for the company’s AI-designed cancer drug, REC-1245, they screened a large collection of cancer models from the Cancer Cell Line Encyclopedia – a resource from the Broad Institute of MIT and Harvard along with Novartis which provides open access to genomic data for nearly 1,000 cancer cell lines. They discovered a better response to the drug in patient models where normal DNA repair systems weren’t working correctly due to replication stress and DNA repair vulnerabilities (DDR defects). They then used those biomarkers to select patients – including those with MSI-high and HRR cancers – for the DAHLIA clinical trial.

A study last year used the CCLE data to screen over 750 cell lines with 51 possible combinations of cancer drugs to find biomarkers for patients who would benefit from these therapeutic combinations to improve their treatment outcomes.

And in July 2025, researchers from Pfizer published the discovery of seven new breast cancer biomarkers and related therapeutic targets using the Cancer Genome Atlas.

In talking about the role of data and AI in precision oncology, Diana J. Azzam, PhD, scientific director at Center for Advancing Personalized Cancer Treatments at Florida International University told the ASCO Post:

“The way to advance this research is to integrate functional precision medicine, genomics, and AI to accurately identify novel biomarkers; then the goal is to match them with effective drugs and predict combinations for each individual patient based on the tumor’s genomics and DNA sequencing. The more data we can generate, the better we will be able to train AI algorithms in functional precision medicine approaches and, hopefully, truly personalize cancer care for patients in the future.”

About Brita Belli

Brita Belli is an award-winning science and tech writer and Senior Manager of Communications at Recursion, a clinical-stage TechBio company. Her writing has been featured in the New York Times, National Geographic, MSN.com, and Alternet, as well as trade publications like OnDrugDelivery, European Biopharmaceutical Review, and OR Manager Magazine. She’s the author of The Autism Puzzle: Connecting the Dots Between Environmental Toxins and Rising Autism Rates (Seven Stories Press).