How to fight COVID-19 with machine learning

Introduction

Viral pandemics are a serious threat. COVID-19 is not the first, and it won't be the last.

But, like never before, we are collecting and sharing what we learn about the virus. Hundreds of research teams around the world are combining their efforts to collect data and develop solutions.

We want to shine a light on their work and show how machine learning is helping us to:

Identify who is most at risk,
Diagnose patients,
Develop drugs faster,
Finding existing drugs that can help
Predict the spread of the disease,
Understand viruses better,
Map where viruses come from, and
Predict the next pandemic.

Let’s promote the research to fight this pandemic – and prepare ourselves better for the next one.

1. Identifying who is most at risk from COVID-19

Machine learning has proven to be invaluable in predicting risks in many spheres.

With medical risk specifically, machine learning is potentially interesting in three key ways.

Infection risk: What is the risk of a specific individual or group getting COVID-19?

Severity risk: What is the risk of a specific individual or group developing severe COVID-19 symptoms or complications that would require hospitalization or intensive care?

Outcome risk: What is the risk that a specific treatment will be ineffective for a certain individual or group, and how likely are they to die?

Machine learning can potentially help predict all three risks. Although it’s still too early for much COVID-19-specific machine learning research to have been conducted and published, early experiments are promising. Furthermore, we can look at how machine learning is used in related areas and imagine how it could help with risk prediction for COVID-19.

1.1 Predicting the risk of infection

Early statistics show that important risk factors that determine how likely an individual is to contract COVID-19 include:

Age,
Pre-existing conditions,
General hygiene habits,
Social habits,
Number of human interactions,
Frequency of interactions,
Location and climate,
Socio-economic status.

Risk research for the current pandemic is still in the early stages. For example, DeCapprio et al. have used machine learning to build an initial Vulnerability Index for COVID-19. Prevention measures such as wearing masks, washing hands, and social distancing are all likely to influence overall risk as well. As more and better data becomes available and currently ongoing studies produce results, we will likely see more practical applications of machine learning for predicting infection risk.

1.2 Predicting who is at risk of developing a severe case

Once a person or group has become infected, we need to predict the risk of that person or group developing complications or requiring advanced medical care. Many people experience only mild symptoms, while others develop severe lung disease or acute respiratory distress syndrome (ARDS), which is potentially deadly. It’s not possible to treat and closely monitor everyone with mild symptoms, but it’s far better to start treatment early if more severe symptoms are likely to develop.

In the Computers, Materials and Continua journal, researchers published an article showing that machine learning could potentially predict the likelihood of a patient developing ARDS as well as the risk of mortality, just by looking at the initial symptoms. The researchers acknowledge the limitations of this research:

“A clear limitation of this study is the size of the dataset; 53 patients with some incomplete data as well as a limited spectrum of severity.”

But the study lays important groundwork for applying machine learning once more data becomes available.

1.3 Predicting treatment outcomes

An extension of severity prediction is predicting the treatment’s outcome, which is often literally a matter of predicting life and death. Clearly, it would be useful to know how likely a patient is to survive, given certain symptoms. But on top of this, it’s important to keep in mind that not all patients are treated in the same way. Given a specific patient or group, how effective is a specific treatment likely to be?

If we can predict the outcomes of specific treatment methods, then doctors can treat patients more effectively. Using machine learning to personalize treatment plans is not specific to COVID-19, and machine learning has previously been used to predict treatment outcomes for patients with epilepsy, as just one example. Researchers have also used machine learning to predict responses to cancer immunotherapy.

Because treatment options for COVID-19 are still evolving, it will likely be some time before we see machine learning applied to predicting outcomes for specific treatments. But outcome prediction remains an important part of risk assessment, working hand-in-hand with the infection and severity predictions we discussed above.

2. Screening patients and diagnosing COVID-19

When a new pandemic hits, diagnosing individuals is challenging. Testing on a large scale is difficult and tests are likely to be expensive, especially in the beginning. Anyone who has any symptoms of COVID-19 is likely to be very concerned that they have contracted the disease, even if the same symptoms are indicative of many other, potentially milder diseases too.

Instead of taking medical samples from each patient and waiting for slow, expensive lab reports to come back, a simpler, faster, and cheaper test (even if it’s less accurate) would be useful in gathering data on a larger scale. This data could be used for further research, as well as for screening and triaging patients.

When it comes to using machine learning to help diagnose COVID-19, promising research areas include:

Using face scans to identify symptoms, such as whether or not the patient has a fever,
Using wearable technology such as smart watches to look for tell-tale patterns in a patient’s resting heart rate,
Using machine learning-powered chatbots to screen patients based on self-reported symptoms.

2.1 Screening patients using face scans

Although there are few precise details available, a hospital in Florida was one of the first to attract attention for using machine learning to help respond to COVID-19. Upon entering the hospital, patients are given an automatic face scan, which uses machine learning to detect whether or not they have a fever.

On its own, this data is probably not extremely helpful, but when dealing with hundreds or even thousands of patients, every piece of data is important in helping triage them effectively.

2.2 Using wearable technology to screen for resting heart rate

Apple made headlines when they used their Apple Watch to detect common heart issues with the help of machine learning. But patterns in resting heart rate can be indicative of more specific problems too, and some preliminary research using Fitbit data indicates that changes in resting heart rate can help identify “ILI” or “influenza-like illness” patients. Obviously, this is a long way from diagnosing COVID-19 specifically, but the research is still young.

Similarly, research from OURA, a sleep and activity tracking ring, uses body temperature, heart rate, and breathing rate to try to “identify patterns of onset, progression, and recovery for COVID-19.”

Both studies are still in progress, so no results are available yet.

2.3 Using chatbots for screening and diagnosis

If doctors spend too much time answering worried patients’ basic questions, they have less time to focus on treating patients who need them more. Many countries have therefore developed “self-triage” systems, where patients complete a questionnaire about their symptoms and medical history before being advised whether to stay home, call a doctor, or visit a hospital.

Many companies, including Microsoft, have released chatbots that help people self-identify their best course of action, given their specific symptoms.

Based on these examples, we can see that machine learning is currently more suited to helping screen COVID-19 patients rather than reliably diagnosing them. Doing real diagnostics is challenging, partly because any diagnostic algorithm also has to be robust to mutations. Pardis Sabeti discusses some related challenges in a Ted Talk:

“We also could see that, as the virus was moving between humans, it was mutating. And each of those mutations are so important, because the diagnostics, the vaccines, the therapies that we're using, are all based on that genome sequence – fundamentally, that's what drives it.”

If we do all of this work for a specific virus and then that virus mutates, a lot of work is potentially wasted and has to be redone. If we do find a “holy grail” machine learning algorithm that can quickly and accurately detect COVID-19, it will need to be robust enough to handle mutations.

3. Speeding up drug development

In response to a new pandemic, it’s critical to come up with a vaccine, a reliable diagnostic method, and a drug for treatment – fast. Current methods involve a lot of trial and error, which takes time. It can take months to isolate even one viable vaccine candidate.

Machine learning can speed up this process significantly without sacrificing quality control. When researchers were trying to find small molecule inhibitors of the Ebola virus, they discovered that training Bayesian ML models with viral pseudotype entry assay and the Ebola virus replication assay data helped speed up the scoring process. (Scoring involves assigning each molecule a value based on how likely it is to help.) This accelerated process quickly identified three potential molecules for testing.

Similarly, researchers working on H7N9 discovered that ML-assisted virtual screening and scoring led to substantial improvements in the accuracy of the scores. Using the random forest algorithm (a classification algorithm made up of lots of decision trees) provided the best results with H7N9.

In situations like the COVID-19 pandemic, where a virus is spreading rapidly, getting more accurate scores faster is critical to speeding up drug development.

4. Identifying effective existing drugs

Companies spend a lot of time and money getting new drugs approved. They need to be as sure as they possibly can that these drugs won’t have unexpected, harmful side effects.

This process protects us, but it also slows us down during a pandemic – just when we need a faster response.

One alternative is to repurpose drugs that have already been tested and used to treat other diseases.

But there are thousands of drug candidates, and we don’t have time to test them all – so how do we find the right one?

Machine learning can help us prioritize drug candidates much faster by automatically:

Building knowledge graphs and
Predicting interactions between drugs and viral proteins.

4.1 Building biomedical knowledge graphs

A lot of what we know about drugs, viruses, and their mechanism is spread across a huge number of research articles. We can use natural language processing (machine learning applied to text) to read and interpret a large number of scientific articles and build biomedical knowledge graphs, which are structured networks that meaningfully connect different entities, such as drugs and proteins).

Specifically, scientists have customized an ML-built knowledge graph and applied it to COVID-19 to find a connection between the virus and the potential drug candidate Baricitinib.

COVID-19 most likely uses the protein ACE2 to enter our lung cells. This process – known as endocytosis – is regulated by AAK1 (another protein). Baricitinib inhibits AAK1, and could potentially also prevent COVID-19’s entry into our lung cells.

Baricitinib inhibiting AAK1 - endocytosis. Coronavirus.

4.2 Predicting drug-target interactions

Scientists are also using machine learning to identify drug candidates by predicting drug-target interactions (DTIs) between the virus’s proteins and existing drugs.

These interactions are highly complex, so researchers mostly choose neural networks to identify them (1, 2, 3). These networks are trained on large DTI databases to generate lists of particular drug candidates that are most likely to bind to and inhibit the virus’s proteins.

Notably, one research group has developed an end-to-end framework for using neural networks to process knowledge graphs, such as the one used to find Baricitinib. The model is then trained to interpret the knowledge graph and can be used to accurately predict DTIs.

Using this graph-topology learning model, researchers have already found a promising drug candidate, which is currently in clinical trial.

5. Predicting the spread of infectious disease using social networks

In the middle of a pandemic, when we’re trying to develop strategies to actively work against it, we first need to know where we are. We need to answer questions like “How many people are infected?” and “Where are these people?” Unfortunately pandemics – especially those caused by viruses – are difficult and expensive to keep track of.

Usually the government answers these questions, together with the health system. For example, every day (or week) the responsible agency counts and publicizes the number of new patients diagnosed with the disease. But one of the problems here is that there might be a big gap (in time and space) between contracting the disease, developing the first symptoms, and testing positive.

Luckily, we live in a digital world. A farmer who is starting to develop symptoms might live in a small town with no nearby hospitals capable of performing the test. But this same farmer might still be able to access social networks and immediately leave hints about his health and the spread of the disease – hints that only a machine learning model can learn to process at scale.

Mapping social media tweet on a map of the united states

By interpreting the content of public interactions on social media, a machine learning model assesses the likelihood of novel virus contamination. The model might not be able to classify people on an individual level, but it can use all of this data to estimate the spread of the pandemic in real time and to forecast the spread in the upcoming weeks.

The value of this information in decision-making processes in the midst of a rapidly evolving pandemic cannot be overstated.

6. Understanding viruses through proteins

To understand a virus such as COVID-19 is to understand its proteins – whether and how we get sick depends entirely on how these proteins interact with our bodies. But interpreting them is no easy task.

The following use cases provide examples of how machine learning can help improve our understanding of viruses by analyzing their proteins.

6.1 Predicting viral-host protein-protein interactions

Protein-protein interactions (PPIs) between viruses and human body cells determine our body's reactions to pathogens. The virus-host interactome is the entire map of interactions between a virus’s and a host’s proteins. This interactome can be seen as a blueprint of how the virus infects our bodies and replicates in our cells.

Many research groups are working on reducing the vast range of possible interactions. Machine learning models trained with protein data have been successfully used to predict the most likely virus-host PPIs for HIV and H1N1 – greatly reducing the effort required to map the whole virus-host interactome.

Understanding how a virus interacts with our bodies is extremely important in the development of new treatments and the discovery of new drugs.

6.2 Predicting protein folding

Unfolded v.s. folded protein — Unfolded v.s. folded Protein

We know that a protein’s structure is linked to its function – and once this structure is understood, we can guess its role in the cell, and scientists can develop drugs that work with the protein’s unique shape.

But defining a protein’s 3D structure is no easy task – the range of possible structures for a single protein is astronomical: a protein composed of 100 amino acids has 3100 possible conformations.

And there are over one billion known protein sequences, but we have only been able to identify the structures of less than 0.1% of them.

Using artificial neural networks, research groups have successfully built models that can predict protein structures, finally making it feasible to identify protein structures using computational methods.

7. Figuring out how to attack the virus

Epitopes are clusters of amino acids found on the outside of a virus. Antibodies bind to epitopes, which is how our immune system recognizes and eliminates the virus. So finding and classifying epitopes is essential in determining which part of a molecule to target when we develop vaccines.

Compared to traditional vaccines, which contain inactivated pathogens, epitope-based vaccines are safer – they prevent disease without the risk of potentially deadly side effects.

Locating the correct epitope can be a time-consuming, expensive process. With a new pandemic, such as COVID-19, locating epitopes faster speeds up the process of developing effective vaccines.

This is where machine learning can help. Support vector machines (SVM), hidden Markov Models, and artificial neural networks (specifically deep learning) have all proven to be faster and more accurate at identifying epitopes than human researchers are.

8. Identifying hosts in the natural world

A zoonotic pandemic – like the one we are experiencing with the novel coronavirus – is a pandemic caused by an infectious disease that originates in a different species (such as bats) and spreads to humans. Viruses such as Ebola, HIV, or COVID-19 can survive unnoticed in the natural world for a long time, waiting for the next mutation and the next opportunity to infect us. They hide in animals – called reservoir hosts – that are unaffected by the illness.

Knowing who these reservoir hosts are is vital in fighting a pandemic – once we’ve found them, we can develop strategies to control the spread of the disease and prevent more outbreaks from happening.

Viruses and in which type of animal they reside

The classical approach to finding reservoir hosts can take years of research, and there are still many orphan viruses that haven’t been matched to an animal host.

So what can we do?

Thanks to huge advances in technology, Whole-Genome Sequencing (WGS, the process of determining an organism’s complete DNA sequence) has become cheap and fast. Research has shown that machine learning models can use genome sequencing data together with expert knowledge to pinpoint the species that most likely acted as hosts for the disease.

By looking at a small subset of species, we can dramatically speed up the process of finding these pathogens in the wild.

9. Predicting the risk of new pandemics

Accurately predicting whether a strain of influenza is going to make a zoonotic leap (jumping from one species to another) can help doctors and medical professionals anticipate potential pandemics and prepare accordingly.

As one example, Influenza A exists primarily in the avian population, but it has the potential to jump to human hosts. Researchers working on Influenza A isolated 67,940 protein sequences from a database. They filtered these sequences so that the dataset included only those influenza strains with complete sequences of 11 influenza proteins.

With machine learning the researchers were then able to identify potentially zoonotic strains of influenza with high levels of accuracy. More work needs to be done to establish prediction models for direct transmission, but knowing which strains of influenza are likely to make a leap is an important first step in preparing for the next pandemic.

Conclusion

Machine learning is an important tool in fighting the current pandemic. If we take this opportunity to collect data, pool our knowledge, and combine our skills, we can save many lives – both now and in the future.

If you urgently need support to develop a machine learning application in a medical setting (e.g. working towards FDA approval), let’s talk.

How to fight COVID-19 with machine learning

Introduction