A common basis for similarity measures involving two strings†

1 January 1983

journal article
research article
Published by Taylor & Francis in International Journal of Computer Mathematics

Vol. 13 (1) , 17-40
https://doi.org/10.1080/00207168308803349

Abstract

Many numerical indices which quantify the similarity and dissimilarity between a pair of stringsX and Y, have been defined in the literature. Some of these include the Length of their Longest Common Subsequence (LLCS(X, Y)), the Length of their Shortest Common Supersequence (LSCS(X, Y)), and their Generalized Levenshtein Distance (GLD(X, Y)). Some non-numerical indices relating the strings are the set of their common subsequences, the set of their common supersequences and the set of their shuffles. In this paper, we consider an abstract measure between X and Y, written as D(X, Y), defined in terms of two abstract operators ⊕ and ⊛, and a binary function d( [sdot], [sdot] ) whose arguments are symbols of an alphabet Ã Depending on the various concrete operators used for ⊕ and ⊛ and the specific function used for d( [sdot], [sdot] ), all the quantities discussed above can be seen to be particular cases of D(X, Y). We have presented an algorithm to recursively compute D(X, Y), which can serve to be a common scheme to compute all these quantities. Many new results are obtained using this abstract formulation, such as an explicit linear relationship between the LLCS and the LSCS between two strings. It can also be seen that the algorithm to compute the LLCS between X and Y is a special case of the algorithm to compute D(X, Y). In addition, by using different concrete values for ⊕, ⊛ and d([sdot],[sdot]), new one-pass algorithms can be developed to compute various quantities such as the set of Longest Common Subsequences (LCS), the set of all the Shortest Common Supersequences (SCS) without backtracking, and the set of all the shuffles of two strings.

Keywords

This publication has 11 references indexed in Scilit:

A faster algorithm computing string edit distances
Journal of Computer and System Sciences, 1980
The Complexity of Some Problems on Subsequences and Supersequences
Journal of the ACM, 1978
Algorithms for the Longest Common Subsequence Problem
Journal of the ACM, 1977
A fast algorithm for computing longest common subsequences
Communications of the ACM, 1977
Bounds on the Complexity of the Longest Common Subsequence Problem
Journal of the ACM, 1976
Bounds for the String Editing Problem
Journal of the ACM, 1976
A linear space algorithm for computing maximal common subsequences
Communications of the ACM, 1975
An algorithm for the distance between two finite sequences
Journal of Combinatorial Theory, Series A, 1974
The String-to-String Correction Problem
Journal of the ACM, 1974
A general method applicable to the search for similarities in the amino acid sequence of two proteins
Journal of Molecular Biology, 1970