A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences
Top Cited Papers
- 1 June 2006
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 13 (5) , 1028-1040
- https://doi.org/10.1089/cmb.2006.13.1028
Abstract
The DUST module has been used within BLAST for many years to mask low-complexity sequences. In this paper, we present a new implementation of the DUST module that uses the same function to assign a complexity score to a sequence, but uses a different rule by which high-scoring sequences are masked. The new rule masks every nucleotide masked by the old rule and occasionally masks more. The new masking rule corrects two related deficiencies with the old rule. First, the new rule is symmetric with respect to reversing the sequence. Second, the new rule is not context sensitive; the decision to mask a subsequence does not depend on what sequences flank it. The new implementation is at least four times faster than the old on the human genome. We show that both the percentage of additional bases masked and the effect on MegaBLAST outputs are very small.Keywords
This publication has 9 references indexed in Scilit:
- WindowMasker: window-based masker for sequenced genomesBioinformatics, 2005
- STAR: an algorithm to Search for Tandem Approximate RepeatsBioinformatics, 2004
- GenBank: updateNucleic Acids Research, 2004
- Repbase Update: a database and an electronic journal of repetitive elementsTrends in Genetics, 2000
- A Greedy Algorithm for Aligning DNA SequencesJournal of Computational Biology, 2000
- Zones of low entropy in genomic sequencesPublished by Elsevier ,1999
- Tandem repeats finder: a program to analyze DNA sequencesNucleic Acids Research, 1999
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Information enhancement methods for large scale sequence analysisComputers & Chemistry, 1993