Information-theoretic tools for mining database structure from large data sets

13 June 2004

proceedings article
Published by Association for Computing Machinery (ACM)

p. 731-742
https://doi.org/10.1145/1007568.1007650

Abstract

Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets.

Keywords

This publication has 16 references indexed in Scilit:

LIMBO: Scalable Clustering of Categorical Data
Published by Springer Nature ,2004
A case for fractured mirrors
The VLDB Journal, 2003
Interactive deduplication using active learning
Published by Association for Computing Machinery (ACM) ,2002
Mining database structure; or, how to build a data quality browser
Published by Association for Computing Machinery (ACM) ,2002
Information dependencies
Published by Association for Computing Machinery (ACM) ,2000
Tane: An Efficient Algorithm for Discovering Functional and Approximate Dependencies
The Computer Journal, 1999
Mining association rules between sets of items in large databases
Published by Association for Computing Machinery (ACM) ,1993
Vertical partitioning algorithms for database design
ACM Transactions on Database Systems, 1984
Minimum Covers in Relational Database Model
Journal of the ACM, 1980
The use of cluster analysis in physical data base design
Published by Association for Computing Machinery (ACM) ,1975