David Healey, Daniel Domingo-Fernández, J Taylor, Christoph Krettler, Rose Lightheart, Tyson Park, Tobias Kind, August Allen, Viswa Colluru
Enveda’s PRISM (Pretrained Representations Informed by Spectral Masking) model was trained on 1.2 billion small molecule mass spectra, the largest training set of small molecule mass spectra ever assembled.
Summary: 99.9% of small molecules produced in nature are unknown to science. This is because most natural molecules exist in complex mixtures, and to identify a new molecule, you have to isolate sufficient quantities of it with significant purity. Mass spectrometry (MS) can speed up molecule identification because it can acquire identifying information about many new molecules in a single run while they are still in these complex mixtures. However, mass spectra are challenging to interpret, and most computational tools developed to aid their interpretation are trained only on annotated reference spectra from already-known molecules (e.g. a spectra of a molecule of caffeine with the corresponding structure of caffeine). This dataset represents a tiny fraction of all of life’s chemistry. At Enveda, we’ve developed a new foundation model, called PRISM, that uses self-supervised learning techniques on a massive database of unannotated spectra (i.e. those without corresponding molecular structure). Much like with foundation natural language models like GPT or BERT, we see a substantial improvement in downstream predictions of molecular identity using PRISM, which can be attributed to advanced machine learning methods and an order-of-magnitude larger training set. We believe that PRISM represents a new state-of-the-art when predicting the chemical composition of biological samples.
One of the most influential advances in machine learning during the last decade has been the advent of large pretrained foundation models like Google’s BERT for language understanding or OpenAI’s GPT family for language generation. These models were trained on massive datasets of unlabeled data using a technique called self-supervised learning. This is in contrast to most previous methods, which relied on large databases of labeled data (e.g., thousands of pictures of cats with the label “cat”). The ability of machine learning models to make connections and identify patterns without being explicitly taught them is powerful because there is far, far more unlabeled data than labeled data in the world.
Typically, self-supervised learning includes a process of masking, which involves hiding parts of the data and then having the model predict the missing piece given the context that remains (more on this later). The ability to learn the meaning of language by ingesting raw text without the need for explicit human annotations is what allowed these models to scale to billions of examples and more accurate outputs. They are called foundation models because once pre-trained on large unlabeled data, these foundation models generalize to many different tasks, often by being fine-tuned on smaller labeled datasets.
The goal of any foundation model is to increase the accuracy of downstream predictions, and we built PRISM with the goal of further improving the models that underlie our novel drug discovery platform. In this blog post, I will discuss the datasets that we used, our training methodologies, the improvements we saw after integrating PRISM into our models, and what’s next.