A common basis for similarity measures involving two strings†

Abstract
Many numerical indices which quantify the similarity and dissimilarity between a pair of stringsX and Y, have been defined in the literature. Some of these include the Length of their Longest Common Subsequence (LLCS(X, Y)), the Length of their Shortest Common Supersequence (LSCS(X, Y)), and their Generalized Levenshtein Distance (GLD(X, Y)). Some non-numerical indices relating the strings are the set of their common subsequences, the set of their common supersequences and the set of their shuffles. In this paper, we consider an abstract measure between X and Y, written as D(X, Y), defined in terms of two abstract operators ⊕ and ⊛, and a binary function d( [sdot], [sdot] ) whose arguments are symbols of an alphabet à Depending on the various concrete operators used for ⊕ and ⊛ and the specific function used for d( [sdot], [sdot] ), all the quantities discussed above can be seen to be particular cases of D(X, Y). We have presented an algorithm to recursively compute D(X, Y), which can serve to be a common scheme to compute all these quantities. Many new results are obtained using this abstract formulation, such as an explicit linear relationship between the LLCS and the LSCS between two strings. It can also be seen that the algorithm to compute the LLCS between X and Y is a special case of the algorithm to compute D(X, Y). In addition, by using different concrete values for ⊕, ⊛ and d([sdot],[sdot]), new one-pass algorithms can be developed to compute various quantities such as the set of Longest Common Subsequences (LCS), the set of all the Shortest Common Supersequences (SCS) without backtracking, and the set of all the shuffles of two strings.