The amount of biomedical data being generated nationally is exploding, and holds great promise for research.
The data is often organized in the form of networks, which provide insights into interactions among the components of biological systems, such as molecules, genes and cells, as well as associations between these components, their function, diseases and drugs.
But the lack of efficient and effective ways to store, access and query this Big Data slows discovery and applications to improve human health, according to the National Institutes of Health (NIH).
Case Western Reserve University researcher Mehmet Koyuturk, an associate professor of electrical engineering and computer science, was awarded a $1.3 million grant from NIH to help develop open-source software to store and make this mountain range of information handy.
His was one of 15 proposals to win grants under the NIH Big Data to Knowledge (BD2K) initiative. Satya Sahoo, an assistant professor in the Division of Medical Informatics in the Case Western Reserve School of Medicine also received a grant.
Starting in June, Koyuturk will lead a team that includes Purdue University computer science professors Ananth Grama, who specializes in high performance computing, and Wojciech Szpankowski, who specializes in information theory; and Shankar Subramaniam, chair of the bioengineering department at University of California, San Diego. The three researchers were Koyuturk’s mentors while he was a PhD student.
The four will investigate optimal storage schemes, algorithms to compress data yet enable searching and querying of complex and (often) highly connected network data, and also allow users to create their own versions of networks. The algorithms will be built into broadly usable and user-friendly software.
“We were doing this kind of work 10 years ago when I was a PhD student,” said Koyuturk, who studies bioinformatics and algorithms. “We created an algorithm that summarizes networks and applied it and saw it was effective.”
“But no one was talking about Big Data then; the data was being generated rapidly but the community was not ready for it. We were even being questioned why we needed that level of efficiency,” he continued. “Big Data has been recognized during the last five years, and there is a great opportunity now. We’ll take up where we left off.”
The researchers will work with data gathered by the national institutes from studies of the human genome, allergies and disease, neurology, mental health and more. These include networks of basic biological data, molecular interactions and associations, such as genes with disease and drugs with proteins.
“There are lots of network databases, but it’s not easy to query them,” Koyuturk said. “You must now download them to do analysis or query… It can become so complicated since almost every problem that involves network comparisons is intractable.”
However, “When you put all different types of interactions together, the networks become quite large,” Koyuturk said. But that’s the goal: to develop a unified systems view of the cellular machinery of living organisms and be able to access multiple versions of network data quickly and in meaningful ways.
The team will provide and maintain the software, updating it according to feedback from users. The software will include what’s called “version control,” which enables researchers to duplicate earlier work on earlier versions of network data, which is rapidly accumulating and changing.
The technology will specifically be for biomedical scientists, but the engineers believe it will have broader applications.