Abstract
Three measures of sequence dissimilarity have been compared on a computer-generated model system in which substitutions in random sequences were made at randomly selected sites and the replacement character was chosen at random from the set of characters different from the original occupant of the site. The three measures were the conventionalmmismatch count between aligned sequences (AMC=m) and two measures not requiring prior sequence alignment. The latter two measures were the squared Euclidean distance between vectors of counts of t-tuples (t=1–6) of characters in the two sequences (multiplet distribution distances or MDD=d) and counts of characters not covered by word structures of statistically significant length common to the two sequences (common long words or CLW=SIB, SIS, or SAB). Average MDD distances were found to be two times average mismatch counts in the simulated sequences for all values of t from 1 to 6 and all degrees of substitution from one per sequence to so many as to produce, effectively, random sequences. This simple relation held independently of sequence length and of sequence composition. The relation was confirmed by exact results on small model systems and by formal asymptotic results in the limit of so few substitutions that no double hits occur and in the limit of two random sequences. The coefficient of variation for MDD distances was greater than that for mismatch counts for singlets but both measures approached the same low value for sextets. Needleman-Wunsch alignment produced incorrect mismatch counts at higher degrees of substitution. The model satisfied the conditions for the derivation of the Jukes-Cantor asymptotic adjustment, but its application produced increasingly bad results with increasing degrees of substitution in accord with earlier results on model and natural sequences. This fact was a consequence of the increase with increasing degrees of substitution of the sensitivity of the adjustment to error in the observations. Average CLW distances for a variety of common word structures were more or less parallel to MDD distances for appropriately long t-tuples. These results on model systems supported the validity of the two dissimilarity measures not requiring sequence alignment that was found in earlier work on natural sequences (Blaisdell 1989).