Computational solutions to large-scale data management and analysis
Top Cited Papers
- 1 September 2010
- journal article
- review article
- Published by Springer Nature in Nature Reviews Genetics
- Vol. 11 (9) , 647-657
- https://doi.org/10.1038/nrg2857
Abstract
Biological research is becoming ever more information-driven, with individual laboratories now capable of generating terabyte scales of data in days. Supercomputing resources will be increasingly needed to get the most from the big data sets that researchers generate or analyse. The big data revolution in biology is matched by a revolution in high-performance computing that is making supercomputing resources available to anyone with an internet connection. A number of challenges are posed by large-scale data analysis, including data transfer (bringing the data and computational resources together), controlling access to the data, managing the data, standardizing data formats and integrating data of multiple different types to accurately model biological systems. New computational solutions that are readily available to all can aid in addressing these challenges. These solutions include cloud-based computing and high-speed, low-cost heterogeneous computational environments. Taking advantage of these resources requires a thorough understanding of the data and the computational problem. Knowing the parallelization of the analysis algorithms enables a more efficient solution to a computational problem by distributing tasks over many computer processors. The types of parallelism can be classified into two broad categories: loosely coupled (or coarse-grained) parallelism and tightly coupled (or fine-grained) parallelism, each benefiting from different types of computational platforms, depending on the problem of interest. Clusters of computers can be optimized for many different classes of computationally intense applications, such as sequence alignment, genome-wide association tests and reconstruction of Bayesian networks. Cloud computing makes cluster-based computing more accessible and affordable for all. The distributed computing paradigm MapReduce has been designed for cloud-based computing to solve problems such as mapping raw DNA sequence reads to a reference genome (that is, problems that have loosely coupled parallelism). Cloud computing provides a highly flexible, low-cost computational environment. However, the costs of cloud computing include sacrificing control of the underlying hardware and requiring that big data sets be transferred into the cloud for processing. Heterogeneous multi-core computational systems, such as graphics processing units (GPUs), are complementary to cloud-based computing and operate as low-cost, specialized accelerators that can increase peak arithmetic throughput by 10-fold to 100-fold. These systems are specifically tuned to efficiently solve problems involving massive tightly coupled parallelism. Heterogeneous computing provides a low-cost, flexible computational environment that improves performance and efficiency by exposing architectural features to programmers. However, programming applications to run in these environments requires significant informatics expertise. Cloud providers such as Microsoft make advanced cloud computing resources freely available to individual researchers through a competitive, peer-reviewed granting process. Others providers, such as Amazon, provide advanced cloud storage and computational resources via an intuitive and simple web interface. Users of Amazon Web Services can today not only upload big data sets and analysis tools to Amazon S3 but also solve problems using MapReduce via a point-and-click interface.Keywords
This publication has 34 references indexed in Scilit:
- Direct detection of DNA methylation during single-molecule, real-time sequencingNature Methods, 2010
- Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA NanoarraysScience, 2010
- Bacterial Community Variation in Human Body Habitats Across Space and TimeScience, 2009
- CloudBurst: highly sensitive read mapping with MapReduceBioinformatics, 2009
- Infernal 1.0: inference of RNA alignmentsBioinformatics, 2009
- Accelerating molecular dynamic simulation on graphics processing unitsJournal of Computational Chemistry, 2009
- Genetic Mapping in Human DiseaseScience, 2008
- Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networksNature Genetics, 2008
- Genetics of gene expression and its effect on diseaseNature, 2008
- Variations in DNA elucidate molecular networks that cause diseaseNature, 2008