Figaro: a novel statistical method for vector sequence removal
Open Access
- 17 January 2008
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 24 (4) , 462-467
- https://doi.org/10.1093/bioinformatics/btm632
Abstract
Motivation: Sequences produced by automated Sanger sequencing machines frequently contain fragments of the cloning vector on their ends. Software tools currently available for identifying and removing the vector sequence require knowledge of the vector sequence, specific splice sites and any adapter sequences used in the experiment—information often omitted from public databases. Furthermore, the clipping coordinates themselves are missing or incorrectly reported. As an example, within the ∼1.24 billion shotgun sequences deposited in the NCBI Trace Archive, as many as ∼735 million (∼60%) lack vector clipping information. Correct clipping information is essential to scientists attempting to validate, improve and even finish the increasingly large number of genomes released at a ‘draft’ quality level. Results: We present here Figaro, a novel software tool for identifying and removing the vector from raw sequence data without prior knowledge of the vector sequence. The vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics. We show that Figaro achieves 99.98% sensitivity when tested on ∼1.5 million shotgun reads from Drosophila pseudoobscura. We further explore the impact of accurate vector trimming on the quality of whole-genome assemblies by re-assembling two bacterial genomes from shotgun sequences deposited in the Trace Archive. Designed as a module in large computational pipelines, Figaro is fast, lightweight and flexible. Availability: Figaro is released under an open-source license through the AMOS package (http://amos.sourceforge.net/Figaro). Contact: mpop@umiacs.umd.eduKeywords
This publication has 13 references indexed in Scilit:
- The maize genome as a model for efficient sequence analysis of large plant genomesCurrent Opinion in Plant Biology, 2006
- Genome sequencing in microfabricated high-density picolitre reactorsNature, 2005
- Comparative genome sequencing ofDrosophila pseudoobscura: Chromosomal, gene, andcis-element evolutionGenome Research, 2005
- Versatile and open software for comparing large genomesGenome Biology, 2004
- Complete genome sequence of the Q-fever pathogenCoxiellaburnetiiProceedings of the National Academy of Sciences, 2003
- Genome sequence of Chlamydophila caviae (Chlamydia psittaci GPIC): examining the role of niche-specific genes in the evolution of the ChlamydiaceaeNucleic Acids Research, 2003
- Fast algorithms for large-scale genome alignment and comparisonNucleic Acids Research, 2002
- DNA sequence quality trimming and vector removalBioinformatics, 2001
- An Eulerian path approach to DNA fragment assemblyProceedings of the National Academy of Sciences, 2001
- DNA sequencing with chain-terminating inhibitorsProceedings of the National Academy of Sciences, 1977