Using Big Data to predict which viruses can lead to a pandemic

By Petra Wiesmayer

The rapid spread of the Corona virus is currently dominating the news. Since the first cases in China became known in the multi-million populated city of Wuhan in late December 2019, almost every hour there have been reports of new cases around the world. Although research into an effective cure is in full swing, a solution is still a long way off. It is clear that being able to quickly identify this kind of harmful virus is of critical. Thanks to Big Data techniques, German scientists have succeeded in predicting which pathogens can lead to the development of highly contagious strains.

After the SARS pandemic in 2002/2003 and with the constant emergence of new influenza viruses, (which cost more than 25,000 lives in Germany alone during the winter of 2017/2018), the Corona virus case illustrates the importance of being able to swiftly identify specific characteristics of new virus and bacterial strains.

The new, international project “Pangaia” (Pan-genome Graph Algorithms and Data Integration) can play a role in this. Bielefeld University in Germany is participating in this. Using Big Data technology, scientists compare the genome of a single organism with the genomes of all strains of a species. They examine how the masses of used data can be arranged and analyzed in such a way that this can be utilized in biomedicine.

Comparisons with reference genetics

Determining whether the genome exhibits certain variations is based on a reference genome. This entails combining different genomes in such a way that they exhibit the typical characteristics of a whole species. As for influenza viruses, this means that the virus is compared to a reference genome that combines all the typical characteristics of previously known virus strains.

Special graphs depict connections between genomes in terms of nodes. When individual genes from a single organism differ significantly from the typical genes of its species, they show up as conspicuous curves and lines. For example, this is the case for genes that cause hereditary diseases. Photo: University of Bielefeld/R. Wittler

“In these cases, we only compare two genomes with each other. Differences and similarities are relatively easy to recognize on the computer,” says Professor Dr. Jens Stoye from the Technical Faculty of the University of Bielefeld. He is involved in Pangaia together with his genome computer group. “This new approach can increase the number of comparative genomes up to a thousand times”. Researchers call this study of a population’s genetic repertoire ‘pan-genomics.’ “The problem with computer-assisted pan-genomics so far has been the amount of confusion caused by the mass of data,” Professor Dr. Alexander Schönhuth explains, who is coordinating Pangaia’s Bielefeld subproject.

Support Us!

Nucleotides, which are the building blocks of genetic material, are represented by the letters A, C, G and T. However, as they sometimes consist of billions of these information units. These are traditionally displayed side by side as ‘letter chains’ so as to make them easier to compare. “But with hundreds of comparative genomes, it takes a lot of time to analyze step by step how the genome under investigation differs from each of the comparative genomes,” Schönhuth says.

Simultaneous comparison of numerous strains of the same organism

The new technology now makes it possible to analyze numerous strains of the same organism at the same time. Whether it concerns viruses, bacteria or even higher organisms, Jens Stoye states. This way, the similarities and differences between the individual members can be highlighted. When it came to pathogens, it was even often possible to understand and predict the processes that led to the development of highly contagious strains.

In order to make computer-assisted pan-genomics both faster and more application-friendly, the researchers intend to develop new algorithms and data structures over the next few years. For instance, in order to develop algorithms for variation graphs. Using these variation graphs, the computers look for similarities and differences between the comparative genomes. Results are then displayed in graphical terms.

“Variation graphs enable rapid and high-resolution differentiation of pathogenic and innocuous variants of a virus,” says Schönhuth. “In particular, they also allow the identification of entirely new mutations. Like those likely to have occurred in a variant of the corona virus that is currently breaking out in China. Which has triggered resistance to the usual drugs as well.”

Detection of hereditary diseases in humans

According to the scientists, this new method can also be used for the detection of hereditary diseases in humans. On top of that, it can also help to determine which mutations in a tumor have caused severe, pathological growth.

The project runs from January 2020 to December 2023. It is funded by the European Union with €1.14 million from its research framework Horizon 2020 program. Bielefeld University is one of eleven project partners from Europe and North America. The University of Milan (Italy) is coordinating the project. Other partners are the Dutch science organization NWO and the Comenius University of Bratislava (Slovakia). Biotech companies Geneton (Slovakia) and Illumina Cambridge (Great Britain). The Pasteur Institute (France), Simon Fraser University (Canada) and the University of Tokyo (Japan). As well as Cornell University and Pennsylvania State University (both of the USA).

By Petra Wiesmayer

Comparisons with reference genetics

Support Us!

Simultaneous comparison of numerous strains of the same organism

Detection of hereditary diseases in humans

Related Posts