September 13, 2016 – Computing experts at the Department of Energy’s Oak Ridge National Laboratory collaborated with a team of university researchers and software companies to develop a novel hybrid computational strategy to efficiently discover genetic variants on an unprecedented scale.
The team, led by Baylor College of Medicine’s Human Genome Sequencing Center, worked with The Cohorts for Heart and Aging Research in Genomic Epidemiology to analyze more than 5,000 whole genome sequenced samples with decades of health information related to common chronic diseases. The effort included researchers from DNAnexus, Rice University and the Human Genetics Center at the University of Texas Health Science Center.
Large-scale genomic analysis has been limited by a lack of infrastructure for handling massive datasets, and traditional computing infrastructure scaled poorly, leading to impractically long run times, low quality results and computational bottlenecks.
The new approach solves these problems by connecting several powerful computing resources to deliver high-quality sensitive analysis of thousands of genetic samples in a timely manner.
“This is an excellent example of two scientific communities coming together to address challenging science problems. We are happy to have played a part in conducting the analysis of such unprecedented scale,” said Manjunath Gorentla Venkata, co-author and ORNL computer scientist. “While researchers from Baylor discussed the problem, we did not have a ready-made solution. After multiple discussions, we were convinced that mapping pipeline components based on system architecture strengths and tailoring parameters to the architecture would provide quality analysis with a relatively short turnaround.”
The hybrid pipeline processed more than 5,300 whole genome samples in six weeks and can be scaled to analyze more than 10,000 samples with the same high sensitivity and specificity. The entire operation used 5.2 million core hours and transferred six terabytes of data across all the platforms.
The team used the Rhea computing cluster at the Oak Ridge Leadership Computing Facility to reconstruct chromosomal segments inherited from parents and to statistically predict the makeup of incomplete or missing genetic sequences from discovered genetic markers. This step was the most computationally intensive and required the greatest amount of power to calculate the probabilities of the most likely genetic patterns. More than 75 percent of this step was finished on Rhea and the rest was completed on supercomputers at Rice University.
Baylor utilized the Amazon Web Services cloud computing environment to store raw data and discover genetic variants across the thousands of genome samples.
“It has been a tremendous experience to work with a Cloud provider such as DNAnexus, as well as Oak Ridge National Laboratory and Rice University, applying their cloud platforms and supercomputer facilities to address various computational challenges,” said Fuli Yu, assistant professor at Baylor College. “The excitement in this research is not only the scale, but also the interdisciplinary nature in the various levels of this operation.”
The details of the study can be found in the team’s paper, “A hybrid computational strategy to address WGS variant analysis in >5000 samples” published in BMC Bioinformatics. Zhuoyi Huang and Navin Rustagi from the Human Genome Sequencing Center at Baylor College of Medicine led the efforts for this paper in collaboration with Narayanan Veeraraghavan and Richard Gibbs from Baylor College of Medicine, Andrew Carroll from DNAnexus, Eric Boerwinkle from the University of Texas Health Center and Manjunath Gorentala Venkata from Oak Ridge National Laboratory. Fuli Yu from Baylor College of Medicine was the senior author of the study.
Further coverage of the study can be found in this press release from Baylor College of Medicine.
The Oak Ridge Leadership Computing Facility is a DOE Office of Science User Facility.