Information-theoretic tools for mining database structure from large data sets
- 13 June 2004
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 731-742
- https://doi.org/10.1145/1007568.1007650
Abstract
Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets.Keywords
This publication has 16 references indexed in Scilit:
- LIMBO: Scalable Clustering of Categorical DataPublished by Springer Nature ,2004
- A case for fractured mirrorsThe VLDB Journal, 2003
- Interactive deduplication using active learningPublished by Association for Computing Machinery (ACM) ,2002
- Mining database structure; or, how to build a data quality browserPublished by Association for Computing Machinery (ACM) ,2002
- Information dependenciesPublished by Association for Computing Machinery (ACM) ,2000
- Tane: An Efficient Algorithm for Discovering Functional and Approximate DependenciesThe Computer Journal, 1999
- Mining association rules between sets of items in large databasesPublished by Association for Computing Machinery (ACM) ,1993
- Vertical partitioning algorithms for database designACM Transactions on Database Systems, 1984
- Minimum Covers in Relational Database ModelJournal of the ACM, 1980
- The use of cluster analysis in physical data base designPublished by Association for Computing Machinery (ACM) ,1975