Computational solutions to large-scale data management and analysis

Top Cited Papers

1 September 2010

journal article
review article
Published by Springer Nature in Nature Reviews Genetics

Vol. 11 (9) , 647-657
https://doi.org/10.1038/nrg2857

Abstract

Biological research is becoming ever more information-driven, with individual laboratories now capable of generating terabyte scales of data in days. Supercomputing resources will be increasingly needed to get the most from the big data sets that researchers generate or analyse. The big data revolution in biology is matched by a revolution in high-performance computing that is making supercomputing resources available to anyone with an internet connection. A number of challenges are posed by large-scale data analysis, including data transfer (bringing the data and computational resources together), controlling access to the data, managing the data, standardizing data formats and integrating data of multiple different types to accurately model biological systems. New computational solutions that are readily available to all can aid in addressing these challenges. These solutions include cloud-based computing and high-speed, low-cost heterogeneous computational environments. Taking advantage of these resources requires a thorough understanding of the data and the computational problem. Knowing the parallelization of the analysis algorithms enables a more efficient solution to a computational problem by distributing tasks over many computer processors. The types of parallelism can be classified into two broad categories: loosely coupled (or coarse-grained) parallelism and tightly coupled (or fine-grained) parallelism, each benefiting from different types of computational platforms, depending on the problem of interest. Clusters of computers can be optimized for many different classes of computationally intense applications, such as sequence alignment, genome-wide association tests and reconstruction of Bayesian networks. Cloud computing makes cluster-based computing more accessible and affordable for all. The distributed computing paradigm MapReduce has been designed for cloud-based computing to solve problems such as mapping raw DNA sequence reads to a reference genome (that is, problems that have loosely coupled parallelism). Cloud computing provides a highly flexible, low-cost computational environment. However, the costs of cloud computing include sacrificing control of the underlying hardware and requiring that big data sets be transferred into the cloud for processing. Heterogeneous multi-core computational systems, such as graphics processing units (GPUs), are complementary to cloud-based computing and operate as low-cost, specialized accelerators that can increase peak arithmetic throughput by 10-fold to 100-fold. These systems are specifically tuned to efficiently solve problems involving massive tightly coupled parallelism. Heterogeneous computing provides a low-cost, flexible computational environment that improves performance and efficiency by exposing architectural features to programmers. However, programming applications to run in these environments requires significant informatics expertise. Cloud providers such as Microsoft make advanced cloud computing resources freely available to individual researchers through a competitive, peer-reviewed granting process. Others providers, such as Amazon, provide advanced cloud storage and computational resources via an intuitive and simple web interface. Users of Amazon Web Services can today not only upload big data sets and analysis tools to Amazon S3 but also solve problems using MapReduce via a point-and-click interface.

Keywords

This publication has 34 references indexed in Scilit:

Direct detection of DNA methylation during single-molecule, real-time sequencing
Nature Methods, 2010
Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays
Science, 2010
Bacterial Community Variation in Human Body Habitats Across Space and Time
Science, 2009
CloudBurst: highly sensitive read mapping with MapReduce
Bioinformatics, 2009
Infernal 1.0: inference of RNA alignments
Bioinformatics, 2009
Accelerating molecular dynamic simulation on graphics processing units
Journal of Computational Chemistry, 2009
Genetic Mapping in Human Disease
Science, 2008
Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks
Nature Genetics, 2008
Genetics of gene expression and its effect on disease
Nature, 2008
Variations in DNA elucidate molecular networks that cause disease
Nature, 2008