Optimization of de novo transcriptome assembly from next-generation sequencing data
Top Cited Papers
Open Access
- 6 August 2010
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 20 (10) , 1432-1440
- https://doi.org/10.1101/gr.103846.109
Abstract
Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.Keywords
This publication has 49 references indexed in Scilit:
- Transcriptome genetics using second generation sequencing in a Caucasian populationNature, 2010
- Updates to the RMAP short-read mapping softwareBioinformatics, 2009
- RNA-Seq: a revolutionary tool for transcriptomicsNature Reviews Genetics, 2009
- Phylogenomics reveals a new ‘megagroup’ including most photosynthetic eukaryotesBiology Letters, 2008
- Mapping and quantifying mammalian transcriptomes by RNA-SeqNature Methods, 2008
- ALLPATHS: De novo assembly of whole-genome shotgun microreadsGenome Research, 2008
- Short read fragment assembly of bacterial genomesGenome Research, 2007
- Gene expression profiling by massively parallel sequencingGenome Research, 2007
- Multiplex amplification of large sets of human exonsNature Methods, 2007
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997