A common basis for similarity measures involving two strings†
- 1 January 1983
- journal article
- research article
- Published by Taylor & Francis in International Journal of Computer Mathematics
- Vol. 13 (1) , 17-40
- https://doi.org/10.1080/00207168308803349
Abstract
Many numerical indices which quantify the similarity and dissimilarity between a pair of stringsX and Y, have been defined in the literature. Some of these include the Length of their Longest Common Subsequence (LLCS(X, Y)), the Length of their Shortest Common Supersequence (LSCS(X, Y)), and their Generalized Levenshtein Distance (GLD(X, Y)). Some non-numerical indices relating the strings are the set of their common subsequences, the set of their common supersequences and the set of their shuffles. In this paper, we consider an abstract measure between X and Y, written as D(X, Y), defined in terms of two abstract operators ⊕ and ⊛, and a binary function d( [sdot], [sdot] ) whose arguments are symbols of an alphabet à Depending on the various concrete operators used for ⊕ and ⊛ and the specific function used for d( [sdot], [sdot] ), all the quantities discussed above can be seen to be particular cases of D(X, Y). We have presented an algorithm to recursively compute D(X, Y), which can serve to be a common scheme to compute all these quantities. Many new results are obtained using this abstract formulation, such as an explicit linear relationship between the LLCS and the LSCS between two strings. It can also be seen that the algorithm to compute the LLCS between X and Y is a special case of the algorithm to compute D(X, Y). In addition, by using different concrete values for ⊕, ⊛ and d([sdot],[sdot]), new one-pass algorithms can be developed to compute various quantities such as the set of Longest Common Subsequences (LCS), the set of all the Shortest Common Supersequences (SCS) without backtracking, and the set of all the shuffles of two strings.Keywords
This publication has 11 references indexed in Scilit:
- A faster algorithm computing string edit distancesJournal of Computer and System Sciences, 1980
- The Complexity of Some Problems on Subsequences and SupersequencesJournal of the ACM, 1978
- Algorithms for the Longest Common Subsequence ProblemJournal of the ACM, 1977
- A fast algorithm for computing longest common subsequencesCommunications of the ACM, 1977
- Bounds on the Complexity of the Longest Common Subsequence ProblemJournal of the ACM, 1976
- Bounds for the String Editing ProblemJournal of the ACM, 1976
- A linear space algorithm for computing maximal common subsequencesCommunications of the ACM, 1975
- An algorithm for the distance between two finite sequencesJournal of Combinatorial Theory, Series A, 1974
- The String-to-String Correction ProblemJournal of the ACM, 1974
- A general method applicable to the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology, 1970