Heuristic informational analysis of sequences

Abstract
Nucleotide or amino-acid sequences are interpreted as successions of words of length k (k-tuples) the frequencies of which are highly variable in different statistical populations of genes or proteins. After building k-tuple reference tables from coherent subsets or entire data banks, the local information content profile of individual sequences is drawn. Anomalous regions (peaks or depressions) of such a profile can lead to the discovery and identification of specific sequence patterns. Along the same principle, the simultaneous use of two reference statistical populations and the computation of an index combining the two information profiles lead to a general and powerful discriminant analysis methods. The identification of a “signal” associated with gene conversion, the introns/exons discrimination and the location of function specific patterns in proteins are given as examples of successful applications of this heuristic informational approach.