close-icon
Subscribe to learn more about this topic
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Using matchms and Omigami to Find Similar Spectra in the GNPS Dataset

How to use Cosine distance and Spec2Vec to compare spectra

datarevenue-icon
by
Markus Schmitt
Gareth Dwyer

In a previous article, GNPS Data with Python and Pandas, we downloaded the GNPS dataset as a .json file and showed you how to clean and analyze the data using Python and Pandas.

We’ll demonstrate an alternative method in this article. Instead of using the .json file, we'll use an .mgf file (also from GNPS). And instead of Pandas, we'll use the specialized Mass Spectrometry libraries matchms (for Cosine distance) and Omigami (for Spec2Vec and MS2DeepScore) to find similar spectra.

This will get you started with finding clusters of similar spectra. And you’ll be more familiar with data formats and tools built specifically for metabolomics data analysis.

[You can find a Jupyter Notebook version of this article here.]

Overview

To follow along, you should have Python and Jupyter Notebook installed. You should also be comfortable using pip or conda to install third-party Python packages. You'll need a computer with at least 8GB of RAM to comfortably load the entire GNPS dataset into memory. This is what we’ll do:

  • Download the ALL_GNPS.mgf file;
  • Look at the MGF file and see how it compares to JSON;
  • Install matchms and use it to read the MGF file;
  • Use matchms and Cosine to find similar spectra, with an example spectrum;
  • Use Omigami to find similar spectra, using the Spec2Vec and MS2DeepScore algorithms.

Downloading the ALL_GNPS.mgf file

We'll download the .mgf version of the ALL_GNPS dataset. As in the previous tutorial, we'll use wget again to view progress as the download is quite large.

If you don’t have wget, install it first with:

!pip3 install wget

Then download the file by running the following:

import wget
all_gnps_url = "https://gnps-external.ucsd.edu/gnpslibrary/ALL_GNPS.mgf"
wget.download(all_gnps_url)

JSON vs MGF for the GNPS dataset

The JSON file we worked with before looked like the excerpt shown below. JSON can be read and edited by humans. But it’s primarily a machine-readable format: You'll notice all the data on one line, making it hard to understand and manipulate.

An excerpt of the JSON file, with the same keys but all on a single line.

In contrast, the MGF (Mascot Generic Format) file format is primarily a human-readable format. Machines can read it, too. But it’s laid out over more lines and doesn’t use braces or other special characters. This means it’s less space-efficient but easier for humans to read and edit.

An MGF file will use more disk space than an equivalent JSON file. It’s also less interoperable: Generic libraries like pandas can read JSON files, but probably can’t read MGF files. Here is an example excerpt of an MGF file below:

 An excerpt of an MGF file showing keys like PEPMASS, CHARGE, and SMILES.

Each piece of metadata has its own line, with the name, followed by an equals sign, the data, and a newline character. Each spectrum is defined in a block between a BEGIN IONS and END IONS statement. And a blank line separates each block.

Installing matchms and parsing the MGF file

You can install the matchms library using pip by running the following:

!pip install matchms

Built into matchms is a load_from_mgf function. This makes it easy to read our ALL_GNPS file into memory, as follows:

from matchms.importing import load_from_mgf
spectra = load_from_mgf("ALL_GNPS.mgf")

spectra

# >>> 

You'll notice that this code executes very quickly. It hasn't actually read the entire file into memory (RAM) yet. Instead it’s created a Python object: This acts like a Python list, but doesn't actually contain any data. It simply knows where to get the data when it’s needed.

A generator is very efficient. But it can also be inconvenient. You can iterate through it like you would a list. But you can’t, for example, access the third spectrum using the syntax specs[2]. For convenience, you can force it to load everything into memory by converting it into a list as follows:

%%time
spectra = list(spectra)
# >>> CPU times: user 1min 20s, sys: 831 ms, total: 1min 21s
# >>> Wall time: 1min 21s

Because data is now being transferred off your hard disk and into RAM, this takes a couple of minutes. You can check your total number of records by looking at the length of the list:

len(spectra)
# >>> 480902

Understanding the matchms Spectrum object

Each item in the list is a Spectrum object. This is something the matchms library provides. On their own, these objects don't look like much. But every Spectrum object has a .metadata property which contains all the metadata for that spectrum. Run the following code to see this:

spectra[0]
# >>> 
spectra[0].metadata

# output
{'pepmass': (981.54, None),
 'charge': [0],
 'mslevel': '2',
 'source_instrument': 'LC-ESI-qTof',
 'filename': '130618_Ger_Jenia_WT-3-Des-MCLR_MH981.4-qb.1.1..mgf',
 'seq': '*..*',
 'ionmode': 'Positive',
 'organism': 'GNPS-LIBRARY',
 'name': '3-Des-Microcystein_LR M+H',
 'pi': 'Gerwick',
 'datacollector': 'Jenia',
 'smiles': 'CC(C)CC1NC(=O)C(C)NC(=O)C(=C)N(C)C(=O)CCC(NC(=O)C(C)C(NC(=O)C(CCCNC(N)=N)NC(=O)C(C)C(NC1=O)C(O)=O)\\C=C\\C(\\C)=C\\C(C)C(O)Cc1ccccc1)C(O)=O',
 'inchi': 'N/A',
 'inchiaux': 'N/A',
 'pubmed': 'N/A',
 'submituser': 'mwang87',
 'libraryquality': '1',
 'spectrumid': 'CCMSLIB00000001547',
 'scans': '1'}

This output is a Python dictionary containing all the metadata for the first spectrum. When we previously used the JSON file, we noticed that the data for the ionmode attribute was not normalized. And there were several variations. Take a look at this same attribute using matchms as follows:

from collections import Counter
Counter([spectrum.metadata['ionmode'] for spectrum in spectra])

This will output all the values across the whole dataset, along with their frequency counts:

Counter({'Positive': 362212,
         'Negative': 66130,
         'positive': 36163,
         'negative': 14734,
         'Positive-20eV': 2,
         'N/A': 377,
         '': 1284})

You'll see that matchms doesn't do much automatic normalization: There are still inconsistencies in casing (Positive vs positive). But the extra spaces present in the JSON version aren’t seen here.

To update a piece of metadata, you have to overwrite the entire .metadata dictionary. We're not going to walk through cleaning up the dataset again. But here's an example of how to change the Positive label for the first spectrum in the dataset to positive:

print(spectra[0].metadata['ionmode'])
# >>> Positive
# create a temporary dictionary to hold the updated metadata
spectrum_metadata = spectra[0].metadata
spectrum_metadata['ionmode'] = 'positive'

# assign the new dictionary to the spectrum object
spectra[0].metadata = spectrum_metadata
spectra[0].metadata

Now in the output you can see that the ionmode value is lowercase:

{'pepmass': (981.54, None),
 'charge': [0],
 'mslevel': '2',
 'source_instrument': 'LC-ESI-qTof',
 'filename': '130618_Ger_Jenia_WT-3-Des-MCLR_MH981.4-qb.1.1..mgf',
 'seq': '*..*',
 'ionmode': 'positive',
 'organism': 'GNPS-LIBRARY',
 'name': '3-Des-Microcystein_LR M+H',
 'pi': 'Gerwick',
 'datacollector': 'Jenia',
 'smiles': 'CC(C)CC1NC(=O)C(C)NC(=O)C(=C)N(C)C(=O)CCC(NC(=O)C(C)C(NC(=O)C(CCCNC(N)=N)NC(=O)C(C)C(NC1=O)C(O)=O)\\C=C\\C(\\C)=C\\C(C)C(O)Cc1ccccc1)C(O)=O',
 'inchi': 'N/A',
 'inchiaux': 'N/A',
 'pubmed': 'N/A',
 'submituser': 'mwang87',
 'libraryquality': '1',
 'spectrumid': 'CCMSLIB00000001547',
 'scans': '1'}

Now you’re more familiar with matchms and MGF files, let’s walk through using matchms to find clusters of similar spectra.

Finding similar spectra using matchms

Stenothricin spectra form an interesting gene cluster. Matchms includes built-in functionality to find the similarity between given spectra by looking at the cosine distance between the peak data.

You can use Stenothricin C, ID CCMSLIB00000075068, as a starting point. It exists in the GNPS dataset. To pull it out, you can loop through the specs variable until you find it as follows:

stenothricin_c_id = "CCMSLIB00000075068"
stenothricin_c = None

for spectrum in spectra:
   if spectrum.metadata['spectrumid'] == stenothricin_c_id:
       stenothricin_c = spectrum
       break

Now you have the spectrum of interest in your stenothricin_c variable.

To calculate the cosine distance (which represents similarity) to all the other spectra in the dataset, you can use the matchms calculate_scores function with the included CosineGreedy similarity function. Run the following: 

%%time
import matchms
similarity_scores = matchms.calculate_scores(
   references=spectra,
   queries=[stenothricin_c],
   similarity_function=matchms.similarity.CosineGreedy()
)

The queries argument we pass in is a list because you can ask for the similarity of several query spectra at once. In this example, we're just using stenothricin_c. So you'll pass in a list with only a single item. It should take one minute to calculate the similarity scores for all other spectra in the dataset (around 500,000). You can find the results in the returned scores attribute. Take a look at its shape (rows and columns), as follows:

similarity_scores.scores.shape
# >>> (480902, 1)

This score’s object is an ndarray (multidimensional array). It returns a 2D matrix with a column per query we submitted. In our case, our matrix only has one column as we only passed in a single spectrum as a query.

This has calculated the similarity score between our spectrum and every other spectra in the database. But we’re only interested in the scores that have a high similarity rating. We can find the indices of the best matches using numpy argpartition as follows:

import numpy as np

def get_top_matches(xs, n=5):
   return np.argpartition(xs, -n)[-n:]


top_match_indexes = get_top_matches(similarity_scores.scores[:, 0], 10)
top_match_indexes
# >>> array([222615, 222831, 221828, 222626, 345985, 222899, 222891,   1002, 222883, 1001])

This shows us the indices of our best matches in the scores array. We can use this to look up the actual scores in the scores array. And we can get the data about the matching spectra from specs. Run the following:

for mi in top_match_indexes[::-1]:
   query_name = stenothricin_c.metadata['name']
   match_name = spectra[mi].metadata['name']
   match_id = spectra[mi].metadata['spectrumid']
   sim = round(similarity_scores.scores[mi][0][0], 2)
   print(f'{query_name} has a {sim} cosine match with {match_name} ({match_id})')

This should show you the top ten matches:

Stenothricin C M+H has a 1.0 cosine match with Stenothricin C M+H (CCMSLIB00000075068)
Stenothricin C M+H has a 0.76 cosine match with 14-hydroxysprengerinin C [M+Na]+ (CCMSLIB00006456969)
Stenothricin C M+H has a 0.64 cosine match with Stenothricin G M+H (CCMSLIB00000075069)
Stenothricin C M+H has a 0.54 cosine match with 14-hydroxysprengerinin C [M+Na]+ (CCMSLIB00006456977)
Stenothricin C M+H has a 0.47 cosine match with 14-hydroxysprengerinin C [M+Na]+ (CCMSLIB00006456985)
Stenothricin C M+H has a 0.47 cosine match with 20(S)-Ginsenoside F2 [M+Na]+ (CCMSLIB00006580075)
Stenothricin C M+H has a 0.46 cosine match with 14-hydroxysprengerinin C [M+Na]+ (CCMSLIB00006456712)
Stenothricin C M+H has a 0.44 cosine match with 14-hydroxysprengerinin C [M+Na]+ (CCMSLIB00006455914)
Stenothricin C M+H has a 0.43 cosine match with 14-hydroxysprengerinin C [M+Na]+ (CCMSLIB00006456917)
Stenothricin C M+H has a 0.42 cosine match with 14-hydroxysprengerinin C [M+Na]+ (CCMSLIB00006456701)

In the for loop, we use [::-1] to reverse the list as the more similar spectra are towards the end. We print out the similarity between our “query” spectrum (Stenothricin C) and each of the top ten matches. Because we used an example from the dataset as a query spectrum, we get one perfect match (with itself) and then some other close matches with Hydroxysprengerinin C and Stenothricin G.

If we happen to know the spectrumids of some other spectra in the Stenothricin cluster, we can find the similarity to each as follows:

other_stenothricin = {'CCMSLIB00000075069': 'Stenothricin G',
'CCMSLIB00000075070': 'Stenothricin A',
'CCMSLIB00000075071': 'Stenothricin D',
'CCMSLIB00000075072': 'Stenothricin B',
'CCMSLIB00000075073': 'Stenothricin E',
'CCMSLIB00000075076': 'Stenothricin H',
'CCMSLIB00000075077': 'Stenothricin I'
}

steno_indicies = []

for i, spectrum in enumerate(spectra):
   if spectrum.metadata['spectrumid'] in other_stenothricin:
       steno_indicies.append(i)
      
for si in steno_indicies:
   query_name = stenothricin_c.metadata['name']
   match_name = spectra[si].metadata['name']
   match_id = spectra[si].metadata['spectrumid']
   sim = round(similarity_scores.scores[si][0][0], 2)
   print(f'{query_name} has a {sim} cosine match with {match_name} ({match_id})')

And you should see the similarity to each of these:

Stenothricin C M+H has a 0.64 cosine match with Stenothricin G M+H (CCMSLIB00000075069)
Stenothricin C M+H has a 0.03 cosine match with Stenothricin A M+H (CCMSLIB00000075070)
Stenothricin C M+H has a 0.1 cosine match with Stenothricin D M+H (CCMSLIB00000075071)
Stenothricin C M+H has a 0.04 cosine match with Stenothricin B M+H (CCMSLIB00000075072)
Stenothricin C M+H has a 0.05 cosine match with Stenothricin E M+H (CCMSLIB00000075073)
Stenothricin C M+H has a 0.04 cosine match with Stenothricin H M+H (CCMSLIB00000075076)
Stenothricin C M+H has a 0.02 cosine match with Stenothricin I M+H (CCMSLIB00000075077)

These other spectra are part of the Stenothricin cluster. But interestingly, most of their cosine similarity scores are low.

Using Omigami to compute Spec2Vec similarity scores

"Similarity" is often thought of as a fixed metric. In reality, there are many different ways for things to be similar to – or different from – each other. Cosine distance is one way to measure how similar to one another spectra are. But there are many other ways to do this.

Spec2Vec, which we've written about in detail, is often a better measure than Cosine distance. But it's also harder to calculate.

Luckily, Omigami makes it easy. Head over to Omigami.com and sign up for an account. You'll need this to use their API.

Once you've signed up at omigami.com, head over to the Omigami GitHub and follow the installation and configuration instructions.

Now you can submit your spectra to Omigami directly. This skips the need to download the GNPS dataset at all. Omigami works directly with MGF files, so first save our query spectrum to a new file as follows:

import os
from matchms.exporting import save_as_mgf

if os.path.exists("query.mgf"):
   os.remove("query.mgf")
save_as_mgf(stenothricin_c, "query.mgf")

This will create a file called "query.mgf" with a single spectrum in it: the Stenothricin C sample we used above. Now you can submit that query to Omigami as follows:

from omigami import Spec2Vec

def get_spec2vec_similarities():
   client = Spec2Vec()

   n_best_matches = 5
   include_metadata = ["Compound_name"]
   ion_mode = "positive"  # either positive or negative

   result = client.match_spectra_from_path(
       "query.mgf", n_best_matches, include_metadata, ion_mode=ion_mode,
   )

   matches = result[0] # we only submit a single sample so there is only one result
   return matches

matches = get_spec2vec_similarities()
print(matches)

The object you get back from Omigami is already nicely formatted. So you don’t have to do custom formatting as you did for the Cosine similarity scores. Your output should look like this below:


                         score    compound_name
matches of spectrum-0                           
CCMSLIB00000075068            1   Stenothricin C
CCMSLIB00000075071     0.701894   Stenothricin D
CCMSLIB00006552980     0.531979  Ginsenoside Rb1
CCMSLIB00006552726      0.52005  Ginsenoside Rb1
CCMSLIB00006552745     0.501387  Ginsenoside Rb1

As with the Cosine score, we see Stenothricin C as the top match: It's exactly the same sample so we expected this. Unlike before, in second place we now have Stenothricin D with a match of 70% – using Cosine distance gave us only a 10% match to this spectra.

Stenothricin G does not appear in Spec2Vec’s top matches though. This shows how tricky similarity can be. 

You can request more matches from Omigami by changing the value of the n_best_matches variable in the example code above. And if you want more metadata than just the name of matches returned, you can add more items to the include_metadata list.

Trying out MS2DeepScore with Omigami

If you want to try out different similarity algorithms, you can simply switch out the import and client initialization. For example, to try MS2DeepScore instead of Spec2Vec, you'd only need to make two changes to the code.

The rest of the code remains identical as Omigami provides a common interface to different algorithms.

from omigami import MS2DeepScore ## Import the MS2DeepScore client instead of Spec2Vec

def get_spec2vec_similarities():
	client = MS2DeepScore()  ## Change how you initialise the client

	n_best_matches = 5
	include_metadata = ["Compound_name"]
	ion_mode = "positive"  # either positive or negative

	result = client.match_spectra_from_path(
    	"query.mgf", n_best_matches, include_metadata, ion_mode=ion_mode,
	)

	matches = result[0] # we only submit a single sample so there is only one result
	return matches

matches = get_spec2vec_similarities()
print(matches)

You should see something like this:


                         score    compound_name
matches of spectrum-0                           
CCMSLIB00000075068            1   Stenothricin C
CCMSLIB00006553117     0.859747  Ginsenoside Rb1
CCMSLIB00006553135     0.852566  Ginsenoside Rb1
CCMSLIB00006553081     0.845281  Ginsenoside Rb1
CCMSLIB00006553099     0.844628  Ginsenoside Rb1

Note that in both examples we see the same spectrum (Ginsenoside Rb1) returned several times, with slightly different similarities. In a real use-case, it would be important to do some more cleaning and duplicate removal. This prevents a single match from clogging up the returned results. You can adjust the `n_best_matches` variable to get more results and then deduplicate based on the spectrum name.

MS2DeepScore is what we recommend for most real-world use cases. The scores it returns are not directly comparable to Spec2Vec or the Cosine scores as MS2DeepScore is based on tanimoto similarity scores. For more information, take a look at our MS2DeepScore article and our Spec2Vec article.

Where next?

You've got a good grasp of how to work with GNPS data using matchms and you’ve seen the basics of how Omigami works.

For more advanced analysis, take a look at the Omigami documentation. We’re constantly testing and adding the latest metabolomics algorithms, through the same easy and scalable API.

Get Notified of New Articles

Leave your email to get our weekly newsletter.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.