ML squared: an ML tool for the design of ML models – a peek behind the paper with Luis Soenksen and Jackie Valeri

ML squared: an ML tool for the design of ML models – a peek behind the paper with Luis Soenksen and Jackie Valeri

The Language of Genomes

Recently, we interviewed Jackie Valeri (Massachusetts Institute of Technology (MIT), MA, USA), and Luis Soenksen (Wyss Institute of Biologically Inspired Engineering at Harvard University, MA, USA) for a peek behind their recent paper, ‘BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences.’ Jackie is a PhD candidate at MIT and Luis is a PhD graduate from MIT and current researcher at the Wyss Institute. Find out more about BioAutoMATED and how this AI tool can help biologists find the optimum ML models for data analysis below.

Please provide a short summary of your paper

Jackie Valeri (JV): Our manuscript and tool, BioAutoMATED, are our answer to enhance the accessibility of ML for biology. We are fortunate to be able to take advantage of this field called automated ML, which has developed algorithms to essentially help find an optimal ML model architecture and set of ML model parameters that best explain your data.

The model parameters and model architecture you use can drastically change how your data is modeled. This in turn affects the accuracy of your biological predictions and your capability to extract actionable insights from your data, including patterns that biologists might not pick up by eye.

We like to highlight that BioAutoMATED incorporates not just ML, but learning the ML. We took three open-source automated ML algorithms and combined them, because they are often not jointly used together and there is no single best automated ML tool. Each of these three algorithms explores different types of models and they have different ways of finding those optimal model parameters. We have also wrapped all three of these algorithms so that they are amenable to biological input and can be used easily together.

A couple of other things we like to highlight in terms of novelty, is firstly that the platform is an easy-to-use interface, so after you get Docker set up on your computer, we have automated the package installations. We know that this is often a particular pain point for a lot of individuals with limited computational backgrounds and even seasoned experts. We have also automated a data ablation/model saturation test where we look at how well the model would be trained with sequentially fewer and fewer data points. This test is particularly useful as it tells you that in the future you may not need to collect as much data. We also automate this scrambled sequence control test that tells you if the model is actually learning anything from the sequences you are giving it, or if the model would be approximately the same if you gave it scrambled sequences, in other words with the same nucleotide or amino acid frequencies.

Another advantage that the platform has is the built-in interpretation tools. These output results that are actionable for experimentalists to see what regions or positions are important in sequences of interest. We have also integrated a design tool into our platform, which makes it easy for biologists to be able to select which sequences they might want to test next or design from scratch.

Luis Soenksen (LS): To add to that, we made the tool open source, and it has the capacity to really help biologists in analyzing pretty much any sequence that matters in biology. And the word sequence is really important here, because sequence means an array of elements that have a repeating pattern to them, like DNA, RNA, proteins and glycans (carbohydrates that populate cell membranes). These are the sequences that really matter in biology, and our platform has the ability to explore highly predictive models for all of these types of sequences.

Obviously, it still has some constraints. For example, right now our platform can only explore relatively short sequences, so it cannot currently perform anything like full-genome sequencing, but perhaps this will be possible in the future.

As biologists, I think we also recognize that people in science and engineering have time constraints. To take this into account, we have put a few parameters into our platform that users can choose from. For example, the timeframe they want to train the model in (1 day, 1 week, 1 month etc.). These time constraints are particularly interesting, as the platform is essentially ML on top of ML: training a ML system to search for the best possible ML architecture. You can imagine that the possible number of architectures that one could explore is huge, so there are always time constraints.

The last point that I want to make is that we really created this platform because we needed this tool and we wanted to find the optimum ML models quicker. We recognized that everyone was having the same issues and so we put in the time and research into solving this issue more generally.

Read the full paper here

What sparked your interest in this area/why were you inspired to write this paper?

LS: So a few years ago were working on some challenging new synthetic biology developments at the Wyss Institute for Bioinspired Engineering at Harvard (MA, USA). This institution is one of the most advanced places to do this type of work in the world (if not the most advanced). We encountered a few challenges that we knew could benefit from ML, so we decided to go ahead and invest time and money in resolving those gaps; however, we realized that most other labs around us were in a similar need but had a very hard time adopting these types of tools. So we decided to do a tool of our own that could also help our friends and domain experts in the institute kickstart their ML research with their own biological datasets. JV: Like Luis said, we get approached pretty frequently, even by folks within our own lab, to help them with using ML to model a dataset. Sometimes it is a great idea for us to jump into a collaboration with them, especially if there is an overlap in our biological expertise, but sometimes it is hard for us to know what would be best for their specific data and biological question. We wanted to put the power back into the hands of the domain experts, the folks who would really know what their data means in a biological sense.

How do the results of your paper help to overcome challenges in data analysis?

LS: Well, our analysis and demonstrations confirm that by bootstrapping AutoML techniques into a cohesive system that also provides easy-to-use tools for data preparation, model searching, model training, model interpretation and model usage in key biological sequence types, that a lot of domain experts in biology could more readily start incorporating state of the art ML techniques into their own research.

JV: One of the big questions I frequently get asked is, “would ML even work on my data?” Normally, there is a big investment up front on the part of researchers to either learn some ML basics or partner with an ML expert to answer this question. Our BioAutoMATED platform aims to help answer this question with a lower barrier to entry, so you do not need an ML background to start building models. If you find from the results of BioAutoMATED that your dataset is sufficient to build predictive models, then a bigger investment into developing custom models or formalizing a collaboration with ML experts might be worth it. For example, they might decide to further explore variants of convolutional neural networks (CNNs) if BioAutoMATED reports the best performance for CNNs. We also show that you can use BioAutoMATED in an “end-to-end” analysis, from data analysis to validation and follow-up experiments. I think it is also important to note that ML in general and our tool in particular are not a panacea for problems in your data. Researchers should be skeptical of all models, so we have also included a module for testing out your models on external validation dataset to evaluate model generalizability and predictive capability.

What implications does this research have on the field of data analysis? What are some of the most interesting and useful ways that this technology could be used in healthcare?

LS: We believe the use of AutoML in biological research will radically transform the way these domain experts investigate important biological questions, worrying less about the inner workings of the statistical frameworks and coding language required to program and train some of these models, and rather focus all their attention on what matters the most to biologists, to unveil important mechanisms and questions to further understanding of their biological systems of interest. Furthermore, the use of AutoML could standardize how researchers select predictive ML models, which could address reproducibility issues in biological research.

Looking ahead, how do you see the field developing and what are the next steps in this area of research?

LS: I am really excited about even more general forms of AutoML and their use in biological research, along with ways to facilitate the integration of these techniques with future trends in AI, such as the use of large language models (LLMs) like ChatGPT, to lower the barrier for entry even more into the ubiquitous use of AI as a new tool in groundbreaking biological research.

JV: Like Luis, I am really excited about LLM-based tools. I am also optimistic about the future of automation in biology, specifically tools that can help automate not just the liquid-handling of experimental samples but decisions about the next set of experiments to perform to maximize the utility of the data.

About the authors:

Jackie Valeri is a PhD candidate within the Department of Biological Engineering at MIT (MA, USA). Prior to attending MIT, she earned her BSE and MSE in Bioengineering at the University of Pennsylvania (PA, USA). Her current research in the lab of Jim Collins at MIT centers on tackling biological problems with AI and other computational techniques. Jackie is particularly interested in democratizing access to machine learning (ML) tools for biological researchers with limited programming ability and in aiding antibiotics discovery efforts.

Luis R. Soenksen is a bioengineer and biomedical expert. Dr Soenksen has held several scientific and entrepreneurship positions at MIT´s Jameel Clinic for AI and Healthcare (MA, USA) and the Wyss Institute for Bioinspired Engineering at Harvard University (MA, USA), where he led the development and launch of multiple cutting-edge technologies across bioengineering, biomedicine, and ML. Dr Soenksen also holds a PhD in Mechanical Engineering from MIT, as well as a Master of Science in Bioengineering from Johns Hopkins University (MD, USA) and a bachelor’s degree in Biomedical Engineering from the Monterrey Institute of Technology (Monterrey, Mexico). Dr Soenksen is currently acting as Head of Technology at NEOX Public Benefit LLC (NY, USA), a NYC-based startup at the intersection of synthetic biology, computational design and sustainability. Dr Soenksen has made significant contributions to the fields of bioengineering, medical devices, bio-design, tissue engineering, synthetic biology and AI. Dr Soenksen’s work has been published in several high-impact journals such as Science, Nature Biotechnology, Science Translational Medicine and Cell Systems (among others). Dr Soenksen’s research has also led to several patented biomedical technologies, as well as the founding and growth of several companies internationally.

Disclaimer:

The opinions expressed in this interview are those of the author and do not necessarily reflect the views of Future Medicine AI Hub or Future Science Group.