In this article we point out how feature selections can be biased, what a minimally biased feature selection should look like, and how you can easily implement it in R or Python.
This article is based on an excellent interview with Carl Brunius, author of the MUVR package in R, who shared his knowledge and inspired us to implement the same approach in a Python package called Omigami.
When using untargeted metabolomics, our goal is often to identify biomarker candidates – mass spec peaks that have consistently different abundances between e.g. two groups of individuals, e.g. healthy vs diseased or treatment A vs B, or maybe in relation to an environmental exposure on a continuous scale, such as diet.
Once we’ve identified interesting biomarker candidates, the next step is to validate them – to show that they can successfully be used as biomarkers.
Validating the biomarker candidates can take months and cost millions. So we need to be sure the candidates we select will also work on new individuals and didn’t just arise from bias or chance. If none of our candidate compounds makes it through the validation stage, then the entire process was all for nothing.
Given how important it is that we select truly biologically relevant features, it’s surprising how often too little thought goes into testing the robustness of a feature selection technique.
Omics datasets are unique
Compared to datasets in other areas, omics datasets are unique in several ways. They generally have: Many features (often over 10,000) and few samples (20–3,000).
Because we have many features and few samples, running standard univariate analyses and using naive statistical tests to find the interesting variables often leads to bad results and radical overfitting: When you have thousands of variables, some variables will inevitably look like strong predictors – simply due to chance. And when you want to validate them on new research subjects, they suddenly won’t work anymore.
Techniques to ensure accurate, reliable pipelines
Carl highlighted several techniques to help you make your feature selection on untargeted metabolomic – or other omics – data more accurate and reliable. Specifically, the key ingredients are:
- Double cross-validation;
- Finding relevant features with multivariate models;
- Recursive feature elimination; and
- Using permutation tests to evaluate your technique and establish baseline metrics.
We’ll also take a look at the MUVR package for R and the Omigami package for Python, which implement all of the above ideas in easy-to-use packages.
A little background on bias and overfitting
Every model you fit to a dataset is in danger of learning patterns that don’t generalize to new data. Any kind of bias, leakage, or random patterns in your training data can cause your model to overfit.
Carl reminded us that what’s often overlooked is that any type of “improvement” you make to the data – any cleaning, pre-processing, imputation, filtering, correction, or normalization – will also introduce some bias. No change to the data is risk-free.
Even researchers with a solid foundation in statistics often make mistakes based on false intuition, also in omics research.
The multiple hypothesis problem & why not to use univariate tests for feature selection
A lot of scientific research revolves around some kind of p-value. You have a hypothesis, you do an experiment, and you check how likely it is that the results you see could have been produced by chance.
With so much computation power at our fingertips, it’s tempting to test every single variable, treating each compound as a relevant, independent hypothesis. The problem is that when you run 10,000 univariate tests – each testing a different “hypothesis” – you’re going to find patterns that look very convincing, even in truly random data.
It’s standard procedure to protect against such spurious findings by doing some kind of statistical adjustment for multiple testing (like Bonferroni or false discovery rate adjustment). However Carl pointed out several problems with this approach:
- False Negatives: If we adjust for the total number of measured features (in the thousands), we run the risk of getting few or no significant variables at all and getting false negatives instead: Even for truly biologically informative variables, the null hypothesis may not be rejected due to overadjustment.
- Non-relevance: Statistical corrections assume that each of our features, and thus each of our tests, is in fact a relevant hypothesis. But in reality, if you were to sit down and think about whether each of these variables is in fact an interesting hypothesis to test, you’d simply never run many of them.
- Non-independence: Next, these tests assume that the tested hypotheses are independent. If there's one thing we know about all kinds of biological omics, it's that variables are not independent.
So our features are neither all relevant, nor independent - and this makes it really unclear how many variables we need to adjust for - but it’s certainly much fewer than all of them. And even the adjustments don’t fully protect against false positives.
Our experience is that a much more compelling approach is to use multivariate analysis as a tool to identify an interesting, relevant subset of variables. Then we can model the resulting – drastically reduced – list of variables one by one and make adjustments for multiple testing on this set. But there are potential problems in this approach as well. Especially related to false positive discovery due to the greediness of multivariate models.
How your models might look a lot better than they are, despite cross-validation
To improve multivariate modelling performance, the following process is common: 1) perform an initial model, 2) select the subset of most predictive features from that model (e.g., 20 out of 10,000) , and then 3) remodel using these features to build new model with increased predictive performance.
But this technique carries a massive risk of overfitting. Your model might look like it’s performing well, when in fact the selected features have already been fit to the testing data. When you give these features to a model, it will then learn the same patterns, and they will fit the test sets “surprisingly well.” So your model looks much better than it actually is, because a lot of information leaked in via feature selection.
Carl has shown previously, using permutation tests, that standard cross-validation is not sufficient to avoid this kind of bias. And the conclusion is that you need more elaborate validation strategies.
A two-part solution for reliable feature selection on untargeted metabolomics data
With the small number of samples normally observed in metabolomics and the many sources of measurement bias – without even speaking of the variance of the biology that we measure – it’s always hard to achieve results you can be confident in. That said, Carl Brunius and his colleagues have developed a cross-validation framework that can identify the most informative features-of-interest while still reducing the likelihood of overfitting that can arise from working within a single dataset. This framework is built upon the integration of repeated double cross-validation with recursive feature elimination.
Part 1: Recursive feature elimination
Multivariate modelling with Recursive Variable Elimination (RVE) involves:
- Training and optimizing multiple multivariate models;
- Averaging and ranking the variable importances over all the models;
- Removing the fraction of variables with the lowest average importance.
This results in a better set of results than running multiple univariate tests, and it also handles co-correlated variables well. For example, a metabolite which is a byproduct of another but provides a weaker signal will eventually be eliminated, leaving only the one with the stronger signal.
All-relevant and minimal-optimal strategies for variable selection
Variable selection techniques can be loosely grouped into minimal-optimal and all-relevant.
Carl suggests to use a minimal-optimal strategy when the goal is to find the smallest set of variables that can optimally predict the outcome. This can be useful, for example in identifying potential diagnostic biomarkers for a specific disease or exposure.
By contrast, an all-relevant strategy aims to find every variable that is related to the target, even if it’s a weak or redundant variable. This can be more useful, for example if you want to understand biochemical systems and uncover the mechanisms of metabolic processes.
Part 2: Repeated double cross-validation
Cross-validation – splitting our dataset into multiple sets of train and test data, and then optimizing our models so they perform well on all test sets – is a very effective technique to avoid overfitting.
However, if you want to both select features, find optimal parameter settings for the models we train, and avoid overfitting to your observations, then a standard, single cross-validation is usually not good enough.
As mentioned previously, if you carry out variable selection based on the same samples you used to estimate your prediction error, you’ll get selection bias even with cross-validation. As a result your model will overfit and you will drastically underestimate it’s true error rate - and your variable selection is likely to suffer from false positives.
One solution is to nest cross-validation into two loops:
- An inner cross-validation to select features and tune model parameters;
- An outer cross-validation to evaluate the models that are trained on the entire inner data.
Each loop is run multiple times, and each time the folds are shuffled to maximize variability. Then the selected features and parameter settings are averaged over all runs.
Predicting with multiple models rather than one consensus model
What’s often overlooked is that after we’ve cross-validated, we truly only know how our sub-models perform when they’re trained on their respective train sets and tested on their respective test sets.
At this point, you might be tempted to simply take the average parameter settings and train a model on the entire dataset – but how do we know this actually works? Carl mentions in another paper, that there’s no strong theoretical basis for that conclusion. In fact, even if the variables and model parameters, such as number of components in PLS, are selected and tuned correctly, there is still overfitting to the observations to be considered.
Instead, he proposes to keep all of our sub-models, and when we make predictions on unseen samples, we can simply use predictions from all of these models together.
And as an added benefit, we retain variability in our predictions to represent some of the underlying biological variability.
In addition to calculating a mean for these predictions, we can also visualize multiple predictions and get a sense for how trustworthy each prediction is. There are two main cases:
- The prediction values are clustered together: we can be more confident in both the predicted value and the mean.
- The prediction values are widely spread: the models agree less, and we should be more careful with the prediction.
Establishing a baseline with permutation tests
Standard T-tests assume a random baseline, but in omics research, this is sometimes a false assumption. Ask yourself: Could a random dataset produce a similar result? If the answer is yes, your model is probably not as good as you think.
In these cases, we need to adjust our baseline assumptions up. An elegant solution to this problem is permutation testing.
Permutation tests involve running the data through our modelling and feature selection pipeline with random target variables (cohort labels). If we repeat the permutation test several times, we’ll get a distribution of accuracies under random conditions (i.e. the null hypothesis). A p-value for the accuracy of the model can then be obtained by comparing the actual model to the obtained null hypothesis distribution.
In addition, this null hypothesis distribution also allows us to objectively compare different experiment pipelines: Pipelines where the null hypothesis distribution corresponds to the expected number of misclassifications have little overfitting and bias. Conversely, pipelines with higher accuracy on random data indicate a strong tendency to overfit and should probably be avoided.
How to use this technique right away
You can easily run your own untargeted metabolomics experiments with the MUVR package (in R) or the Omigami package (in Python). Both packages implement the techniques discussed in this article, and they allow you to integrate multivariate modelling with minimally biased variable selection within a repeated double cross-validation framework.
Getting started with Omigami is as easy as pip install omigami.
Get help or advice
If you’re doing untargeted metabolomics, we’d love to hear about the challenges you’re working on.