Text joins for data cleansing and integration in an RDBMS

13 May 2004

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 729-731
https://doi.org/10.1109/icde.2003.1260850

Abstract

An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching is effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. We propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.

Keywords

This publication has 6 references indexed in Scilit:

Interactive deduplication using active learning
Published by Association for Computing Machinery (ACM) ,2002
Eliminating Fuzzy Duplicates in Data Warehouses
Published by Elsevier ,2002
A guided tour to approximate string matching
ACM Computing Surveys, 2001
Approximating Matrix Multiplication for Pattern Recognition Tasks
Journal of Algorithms, 1999
Integration of heterogeneous databases without common domains using queries based on textual similarity
Published by Association for Computing Machinery (ACM) ,1998
Integrating structured data and text: A relational approach
Journal of the American Society for Information Science, 1997