Duplicate Record Detection: A Survey
Top Cited Papers
- 30 November 2006
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering
- Vol. 19 (1) , 1-16
- https://doi.org/10.1109/tkde.2007.250581
Abstract
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the areaKeywords
This publication has 61 references indexed in Scilit:
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Mining database structure; or, how to build a data quality browserPublished by Association for Computing Machinery (ACM) ,2002
- A guided tour to approximate string matchingACM Computing Surveys, 2001
- Learning string-edit distancePublished by Institute of Electrical and Electronics Engineers (IEEE) ,1998
- Fast text searchingCommunications of the ACM, 1992
- A new approach to text searchingCommunications of the ACM, 1992
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990
- Fast parallel and serial approximate string matchingJournal of Algorithms, 1989
- Some biological sequence metricsAdvances in Mathematics, 1976
- A general method applicable to the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology, 1970