Abstract
Finding out statistically significant words in DNA and protein sequences forms the basis for many genetic studies. By applying the maximal entropy principle, we give one systematic way to study the nonrandom occurrence of words in DNA or protein sequences. Through comparison with experimental results, it was shown that patterns of regulatory binding sites in Saccharomyces cerevisiae(yeast) genomes tend to occur significantly in the promoter regions. We studied two correlated gene family of yeast. The method successfully extracts the binding sites varified by experiments in each family. Many putative regulatory sites in the upstream regions are proposed. The study also suggested that some regulatory sites are a ctive in both directions, while others show directional preference.

This publication has 0 references indexed in Scilit: