We all know the limitations of reference spectral libraries for metabolites in mass spectrometry. The space of possible metabolite structures is incomprehensibly vast, so you can’t isolate and analyze them all to create reference spectra: You have to turn to computers to predict in silico spectra for these compounds.
Predicting reference spectra isn’t easy. “We're trying to explain the physics of an explosion, and explosions aren't pretty,” says David Wishart, an influential metabolomics investigator at University of Alberta.
The Wishart lab’s CFM-ID spectral prediction and identification software is a preeminent tool tackling this problem – one of many they’ve released for metabolomics.
We spoke to David about a major recent update to CFM-ID. Like its predecessors, CFM-ID 4.0 mixes encoded chemistry knowledge and machine learning to balance prediction accuracy and runtime. But some new tricks allow CFM-ID 4.0 to achieve even better performance, especially in some under-studied metabolite and lipid classes.
In this post, we explain CFM-ID 4.0’s improved accuracy and we’ll show how it compares with previous iterations. We’ll preview future directions of CFM-ID; and show you how to get started with CFM-ID.
CFM-ID 4.0: An overview
There are two groups of in silico tools that identify unknown compounds:
- Tools that take spectra as input and predict compounds. For example, CSI:FingerID uses support vector machines to predict chemical fingerprints for spectra, then suggests candidate compounds that match those fingerprints;
- Tools that take molecular structures as input and predict compounds. You can match experimental spectra to in silico spectra generated by these tools to get putative identifications. Examples include LipidBlast, MetFrag, MIDAS, and MAGMa. CFM-ID falls under this category.
CFM-ID is a graph-based approach that models fragmentation events in ESI-MS/MS experiments. Nodes represent all theoretically possible fragments that can result from a parent compound; edges represent possible transitions resulting from subsequent fragmentations. Edges are weighted by transition probabilities, which are estimated using a training set of known spectra/compounds obtained from Metlin.
CFM-ID mixes handwritten chemistry rules and machine-learned rules to predict spectra. For example, a major improvement between CFM-ID 2.0 and CFM-ID 3.0 was the inclusion of additional specialized fragmentation rules for lipids. When rules are not available to aid in spectral prediction, CFM-ID uses a pre-trained neural network model to generate its predictions.
David and the team made several upgrades in CFM-ID 4.0:
- They improved the model used to predict fragmentation probabilities. Double bond breaks are now modeled more efficiently and unambiguously than in prior versions, causing them to be predicted more accurately. See the original publication in Analytical Chemistry for more detail.
- Transition probability calculations now account for molecular topology, or the atoms near the bond being broken. This is done by representing the molecule structure as an adjacency matrix where relevant features are converted into a feature vector via tensors that are used for learning the transition probabilities. The additional chemical context improves the predicting power of CFM-ID 4.0.
- New hand-coded rules predict not only fragmentation probabilities, but also learned relative intensities of peaks. They dramatically improve runtime and accuracy. These 313 new rules are for 11 metabolite classes with modular structures like acylcarnitines, acylcholines, flavonols, flavones, flavanones, and flavonoid glycosides. They can be applied to predict ESI-MS/MS QTOF spectrums at three different collision energies: 10, 20, and 40 eV.
To demonstrate the upgrades, David’s group gauged performance in two target tasks: predicting spectra from structures, and assigning identities to spectra.
Evaluating CFM-ID 4.0
First, David and his team compared CFM-ID 4.0’s performance in spectral prediction against its predecessor, CFM-ID 3.0. Using the Metlin training datasets, in which experimental spectra and structures are known, they employed a 10-fold cross-validation approach to assess accuracy.
They considered predicted peaks to be a match with the experimental peaks if their mass-to-charge-ratio difference was smaller than 0.01 Da or 10 ppm. They measured accuracy using four different metrics:
- Dice coefficient: measures the size of the intersect of the peak sets divided by the size of the union of the peak sets, times two;
- Reweighted Stein’s dot product: a metric that measures not only overlap of peaks, but also correlation of intensities;
- Recall: proportion of true positives that were identified;
- Precision: proportion of true positives to all positive calls.
David and the team performed this experiment across three different collision energies (10, 20, and 40 eV) and in both ionization modes. After averaging the performance across collision energies together, they found that 4.0 performed substantially better than 3.0.
Applying CFM-ID 4.0 caused gains in all statistics except recall, which was offset by major gains in precision. The team was particularly pleased with increases in dot products: this showed that the accuracy of their peak intensity predictions was also increasing.
The team then determined which major improvement (double bond breaks, molecular topology, or new rules) was responsible for the gains in performance by stripping each element from the model: the topology improvements explained 90% of the gains.
To show the value of their added fragmentation rules for unusual metabolite classes, they performed the same experiment in three new datasets containing metabolites that could be found in exposomic, foodomic, and human metabolomic studies. This time, they added a twist by testing CFM-ID 4.0’s predictive performance with and without its handwritten rulebase.
Encouragingly, they found that in the datasets with lipids, carnitines and polyphenols (the foodomic and human data), inclusion of their new rules noticeably improved predictive performance. Additionally, the rule-based version of the model runs 5-7 times faster than the machine-learning based model.
Finally, David and the team put CFM-ID 4.0 to the test at identifying spectra using a gold standard dataset (CASMI 2016) containing 208 spectra (127 in positive mode in and 81 in negative mode) and the CFM-ID 3.0 spectral database. They compared its performance against several other spectral identification methods:
CFM-ID 4.0 outperformed other tools in the identification task. Intriguingly, the team noted that the CASMI 2016 data were generated on an Orbitrap platform, while CFM-ID 4.0 was trained on QTOF data, demonstrating that CFM-ID predictions are useful for data that was not produced on a QTOF too.
The future of CFM-ID
In many ways, metabolomics is still an unexplored frontier. “I think we're like astronomers,” said David. “We always want to see what's out beyond the solar system, what's beyond the galaxy and what's beyond the edge of the universe.” To meet these challenges, the CFM-ID project has more developments in store.
One promising area for David is in transfer learning: applying a model trained for one problem to another problem. Preliminary evidence shows that CFM-ID performs surprisingly well in contexts for which it has no training, like identifying the spectra of illegal narcotics. David suggests that the future of CFM-ID could lie in building specialized predictors for different classes of molecules that can still leverage transferred knowledge from other predictors.
Other future areas include improving CFM-ID’s performance on non-QTOF instruments and at higher collision energies, like those used in GC-MS experiments. Lastly, David mentions that even more refined incorporation of molecular topology into the fragmentation model could lead to even better predictions.
How to use CFM-ID 4.0
CFM-ID 4.0 is publicly available at the Wishart group's website. You can get spectral predictions at multiple collision energies for metabolites of interest, annotate peaks in a set of spectra for a known molecule, or identify compounds using your choice of candidate database, or even with a list of your own candidate IDs. Source code for the machine-learning based CFM-ID model can be found here, while source code for the rule-based CFM-ID model can be found here.
Feel free to reach out to us at Data Revenue for more information on how to use CFM-ID 4.0.