Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences
Open Access
- 6 September 2005
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 21 (22) , 4125-4132
- https://doi.org/10.1093/bioinformatics/bti658
Abstract
Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK–LD (symmetric Kullback–Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity β between any pair of DNA sequences. Results: Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK–LD performance is superior in both simulation and real data analysis, (4) the estimate of β based on SK–LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays. Availability: The algorithm SK–LD, estimate and simulation software are implemented in MATLAB code, and are available at Contact:tjwu@stat.ncku.edu.tw Supplementary information: Tables A1–A3, and Remarks 1–11 at
Keywords
This publication has 30 references indexed in Scilit:
- Finding functional sequence elements by multiple local alignmentNucleic Acids Research, 2004
- Analysis of genomic sequences by Chaos Game RepresentationBioinformatics, 2001
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequencesBioinformatics, 1994
- Basic local alignment search toolJournal of Molecular Biology, 1990
- The Erdos-Renyi Law in Distribution, for Coin Tossing and Sequence MatchingThe Annals of Statistics, 1990
- Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a computer-generated model systemJournal of Molecular Evolution, 1989
- Statistical method for predicting protein coding regions in nucleic acid sequencesBioinformatics, 1987
- A measure of the similarity of sets of sequences not requiring sequence alignment.Proceedings of the National Academy of Sciences, 1986
- Multinomial Goodness-Of-Fit TestsJournal of the Royal Statistical Society Series B: Statistical Methodology, 1984