Subscribe to learn more about this topic
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

CANOPUS: Use Deep Learning to Predict Chemical Class from MS2 Spectra

A unique machine learning strategy allows you to annotate each spectrum in an MS2 experiment with high accuracy on the chemical class level

Markus Schmitt
Andrew Patt

Assigning metabolite identities to MS2 spectra found in untargeted metabolomics studies is challenging. We’ve previously seen how the work of Daniel Petras and the Global Natural Products Social Networking (GNPS) team is helpful for this process (check out Feature-Based Molecular Networking and Ion Identify Molecular Networking here).

Most spectral annotation approaches currently rely on molecular networking or spectral library matching. These approaches fail when they can’t find library matches for your spectra (or propagate matches from other spectra), meaning they leave a lot of your spectra unannotated . 

Kai Dührkop (Sebastian Böcker's group from the Friedrich Schiller University of Jena) and Louis-Felix Nothias (University of Geneva, former member of Pieter Dorrestein's lab at UCSD / GNPS team) recently developed a completely different approach to address this issue, called CANOPUS (class assignment and ontology prediction using mass spectrometry). Rather than assigning metabolite-level identifications to tandem spectra, CANOPUS uses deep learning to predict the chemical class associated with these spectra. It has 100% coverage and astonishing accuracy.

Although it can’t directly replace library matching approaches like FBMN, CANOPUS is a great alternative when library matches can’t be found or propagated.

In this post, we’ll first explain how CANOPUS works, and then show some real-life applications in which CANOPUS provides you with insights that other tools – such as CSI:FingerID or molecular networking – couldn’t. CANOPUS is a powerful new tool that’s perfect for increasing the number of annotated metabolites in your study, describing chemical diversity in a sample, and even for elucidating novel structures.

How to predict chemical class from spectra with CANOPUS

CANOPUS uses known annotations from spectral and chemical structure libraries to learn to make chemical class predictions on your experimental spectra. 

CANOPUS first converts your spectra to chemical fingerprints. To do this, CANOPUS uses the CSI:FingerID tool, which was also developed by the team from Jena. CSI:FingerID first takes spectra from spectral libraries and calculates their fragmentation trees. It then uses a kernel Support Vector Machines (SVMs) model to convert the fragmentation trees into probabilistic molecular fingerprints. These fingerprints are the input to the Deep Neural Network (DNN) that powers CANOPUS. 

CANOPUS training workflow 
Overview of the CANOPUS workflow depicting how the model is trained on library spectra and structures.

To train the DNN, the team assembled a structural library of over four million compounds. A ClassyFire superclass, class, and subclass designation were available for all four million compounds. In practice, CANOPUS can often predict class to the fifth or sixth subclass depth. 

Thanks to this massive training set, CANOPUS can provide high-quality predictions for 2,497 different chemical classes. For example, CANOPUS correctly predicted the chemical class of input spectra 99.7% of the time in their test dataset. 

CANOPUS outperforms other class prediction methods

The team proved the advantages of CANOPUS over existing methods by comparing against four approaches for chemical class annotation of MS2 spectra:

  1. CSI kernel SVM: Direct chemical class prediction from spectra using CSI:FingerID (SVMs are used to predict class instead of fingerprint);
  2. MetFrag KNN-5: Search in a structural database using MetFrag and infer class from top-5 majority vote of candidate matches;
  3. CSI:FingerID KNN-5: The same as 2, but using CSI:FingerID instead of MetFrag;
  4. Spectral library KNN-5: The same as 2 and 3, but simply searching the database with no precursor ion filtration.

Kai and his team ran these methods on the same independent dataset used to measure CANOPUS prediction accuracy, and compared the performance of these four methods and CANOPUS in this set using three metrics: 

  • Matthews Correlation Coefficient (MCC, which is a coefficient representing the confusion matrix of the predictions made by each algorithm);
  • Precision;
  • Recall.
Histograms showing the Matthews Correlation Coefficient, Precision, and Recall of each method used to predict chemical class in an independent test dataset.
CANOPUS chemical class prediction performance against four other strategies. Methods that used CSI:FingerID performed more strongly than those that didn’t. CANOPUS performed the best in terms of each metric.

Encouragingly, CANOPUS was the top performer according to each metric, with particularly impressive recall compared to the approaches they compared against. CANOPUS was demonstrably better at predicting known chemical classes than existing methods. 

How does this improved performance impact the insights that you can draw from real data? The team gave examples in which CANOPUS was an indispensable component for correct annotations of experimental spectra.

CANOPUS annotates more spectra than Molecular Networking

You can use CANOPUS for many research questions that molecular networking is used for. The main advantage of CANOPUS is that it annotates spectra left out by molecular networking (typically the vast majority of spectra). To demonstrate this, the team looked at a publicly available untargeted LC-MS/MS dataset that compares various organs from Germ-Free (GF) mice with Specific Pathogen-Free (SPF) mice. 

When the team applied molecular networking to this dataset, they found 344 subnetworks in their molecular network (check this out for a refresher on molecular networking). They found that only 27% of these subnetworks had library spectra matched to any of their nodes, which would allow for identity propagation. Further, there were 376 singleton nodes without annotations in the final network, which could not have identities propagated to them. All these missed spectra are annotated by CANOPUS, even if only on the chemical class level.

Even in instances when molecular networking can provide a putative identity, CANOPUS can sometimes annotate more accurately. For example, the team focused on a subnetwork in their data containing daidzene, which had 18 other feature nodes. Besides daidzene, two other nodes were matched to libraries – both isoflavonoids. If chemical class had been propagated through this network, all nodes would have been labeled incorrectly as isoflavonoids. When they applied CANOPUS, it correctly classified nodes in this network as belonging to other chemical classes. 

Example molecular network demonstrating the resolving power of CANOPUS
Molecular network containing Daidzein in untargeted LC-MS/MS mouse study. Class annotations are predicted by CANOPUS. Although compounds that were matched to libraries were all isoflavonoids, CANOPUS found several chemical classes that would be missed by a chemical class propagation strategy.

CANOPUS particularly shines when spectral libraries fall short, because its predictions are independent from the contents of spectral libraries. This means they not only offer complete coverage, but they’re also free of chemical class biases that may exist in individual libraries. This makes CANOPUS a perfect resource for examining chemical diversity.

CANOPUS illuminates chemical diversity

CANOPUS is a great tool for mapping the chemical landscape captured in your data. For example, the team wanted to compare the diversity of compounds present in some Euphorbia species in a previous study. They used an ensemble approach of various class prediction techniques including CSI:FingerID and molecular network structure propagation to annotate ~30% of the dataset in the original study performed on this data. 

In reanalysis, the ability of CANOPUS to annotate all spectra in the dataset was crucial, and led to new discoveries. For example, in the original analysis, only one-to-three benzoic acid esters were annotated in each species. With CANOPUS, the team observed high variation in the numbers of benzoic acid esters detected in each species. This highlighted the benzoic acid class as an important chemical class that appeared uninteresting in the original study.

Heatmap showing the number of features classified in six chemical classes from various Euphorbia species.
The GNPS team examined chemical variation in different Euphorbia species. They found high diversity in classes that seemed to have low diversity in an initial analysis that used molecular networking instead of CANOPUS. These diverse classes included “Benzoic acids and derivatives” and “Diterpenoids.”

Importantly, the team was able to reproduce findings from the original study as well. For instance, they showed that diterpenoid diversity varied based on the geographic location where each species was found. Overall, CANOPUS provided deeper insights than molecular networking from the same data. 

The use of CANOPUS extends beyond the examples in untargeted studies above. For example, CANOPUS outputs multiple candidate class predictions, which can be helpful for describing previously unknown compounds in your data.

CANOPUS helps with structural elucidation

Figuring out the structure of an unknown feature is a time-consuming task. It typically requires the purification of a compound mixture as well as NMR experiments. Since CANOPUS can be run on crude mixtures rather than on purified, isolated compounds (a time-consuming process), it has the potential to accelerate structural elucidation. 

CANOPUS is also excellent for structural elucidation because it can predict additional information beyond chemical class, such as alternate class predictions and chemical substructure predictions. This can be critical when you analyze undescribed compounds.

In an example study, Kai and his team were trying to elucidate the structure of a metabolite extracted from a marine cyanobacterium. CANOPUS predicted that this compound would be a depsipeptide – by contrast, none of the top 20 candidates returned by CSI:FingerID predicted this. Interestingly, CANOPUS also generated alternative class predictions that corresponded to true structural features of the molecule as well as correct substructure predictions, both of which were validated using NMR experiments.

Chemical structure of a new compound with color-coded regions correctly predicted by CANOPUS.
CANOPUS correctly predicted a number of chemical moieties in a novel chemical compound of interest. b shows substructures corresponding to alternate class predictions and c shows substructures predicted with a posterior probability greater than 50%. 

The accurate suggestions for potential substructures provided by CANOPUS greatly accelerated the eventual elucidation of the structure of the new molecule. In comparison, CSI:FingerID struggled because the compound had never been identified before. CANOPUS is designed to surpass competing methods when prior knowledge is scarce.

Getting Started with CANOPUS

Incorporating CANOPUS into your research is easy. CANOPUS is freely available as a web service through the SIRIUS GUI or command line tool, which can be downloaded here (note that CANOPUS requires SIRIUS version 4.4 or higher). More information on how to use CANOPUS straight from Kai Dührkop and the team behind CANOPUS can be found at this web page, which also provides a Jupyter notebook to visualize results. Lastly, check out the original CANOPUS publication from Nature Biotechnology for additional details on how CANOPUS was built and validated.

If you’ve any questions on how to use CANOPUS to improve your metabolomics research, you can reach out to us as well!

Get Notified of New Articles

Leave your email to get our weekly newsletter.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.