Text joins for data cleansing and integration in an RDBMS
- 13 May 2004
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching is effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. We propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.Keywords
This publication has 6 references indexed in Scilit:
- Interactive deduplication using active learningPublished by Association for Computing Machinery (ACM) ,2002
- Eliminating Fuzzy Duplicates in Data WarehousesPublished by Elsevier ,2002
- A guided tour to approximate string matchingACM Computing Surveys, 2001
- Approximating Matrix Multiplication for Pattern Recognition TasksJournal of Algorithms, 1999
- Integration of heterogeneous databases without common domains using queries based on textual similarityPublished by Association for Computing Machinery (ACM) ,1998
- Integrating structured data and text: A relational approachJournal of the American Society for Information Science, 1997