Estimating alphanumeric selectivity in the presence of wildcards

1 June 1996

journal article
Published by Association for Computing Machinery (ACM) in ACM SIGMOD Record

Vol. 25 (2) , 282-293
https://doi.org/10.1145/235968.233341

Abstract

Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determining the optimal join order. Previous work has concentrated on estimating selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. With the popularity of textual data being stored in databases, it has become important to estimate selectivity accurately for alphanumeric fields. A particularly problematic predicate used against alphanumeric fields is the SQL like predicate [Dat]. Techniques used for estimating numeric selectivity are not suited for estimating alphanumeric selectivity.In this paper, we study for the first time the problem of estimating alphanumeric selectivity in the presence of wildcards. Based on the intuition that the model built by a data compressor on an input text encapsulates information about common substrings in the text, we develop a technique based on the suffix tree data structure to estimate alphanumeric selectivity. In a statistics generation pass over the database, we construct a compact suffix tree-based structure from the columns of the database. We then look at three families of methods that utilize this structure to estimate selectivity during query plan costing, when a query with predicates on alphanumeric attributes contains wildcards in the predicate.We evaluate our methods empirically in the context of the TPC-D benchmark. We study our methods experimentally against a variety of query patterns and identify five techniques that hold promise.

Keywords

This publication has 16 references indexed in Scilit:

Query Size Estimation by Adaptive Sampling
Journal of Computer and System Sciences, 1995
Combinatorial pattern discovery for scientific data
Published by Association for Computing Machinery (ACM) ,1994
Practical prefetching via data compression
Published by Association for Computing Machinery (ACM) ,1993
A linear-time probabilistic counting algorithm for database applications
ACM Transactions on Database Systems, 1990
Data compression with finite windows
Communications of the ACM, 1989
Approximating the number of unique values of an attribute without sorting
Information Systems, 1987
Access path selection in a relational database management system
Published by Association for Computing Machinery (ACM) ,1979
Compression of individual sequences via variable-rate coding
IEEE Transactions on Information Theory, 1978
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM, 1976
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM, 1968