WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences
Open Access
- 1 January 1992
- journal article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 20 (11) , 2871-2875
- https://doi.org/10.1093/nar/20.11.2871
Abstract
We present here a fast and sensitive method designed to isolate short nucleotide sequences which have non-random statistical properties and may thus be biologically active. It is based on a first order Markov analysis and allows us to detect statistically significant sequence motifs from six to ten nucleotides long which are significantly shared (or avoided) in the sequences under investigation. This method has been tested on a set of 521 sequences extracted from the Eukaryotic Promoter Database (2). Our results demonstrate the accuracy and the efficiency of the method in that the sequence motifs which are known to act as eukaryotic promoters, such as the TATA-box and the CAAT-box, were clearly identified. In addition we have found other statistically significant motifs, the biological roles of which are yet to be clarified.Keywords
This publication has 31 references indexed in Scilit:
- Genome inhomogeneity is determined mainly by WW and SS dinucleotidesBioinformatics, 1991
- Identification of consensus patterns in unaligned DNA sequences known to be functionally relatedBioinformatics, 1990
- A General Rule for Ranged Series of Codon Frequencies in Different GenomesJournal of Biomolecular Structure and Dynamics, 1989
- Recognition of characteristic patterns in sets of functionally equivalent DNA sequencesBioinformatics, 1987
- Cell-type specific protein binding to the enhancer of simian virus 40 in nuclear extractsNature, 1986
- Linguistics of Nucleotide Sequences: Morphology and Comparison of VocabulariesJournal of Biomolecular Structure and Dynamics, 1986
- Rigorous pattern-recognition methods for DNA sequencesJournal of Molecular Biology, 1985
- ACNUC – a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usageBioinformatics, 1985
- Eukaryotic dinucleotide preference rules and their implications for degenerate codon usageJournal of Molecular Biology, 1981
- Organization and Expression of Eucaryotic Split Genes Coding for ProteinsAnnual Review of Biochemistry, 1981