TAILOR: a record linkage toolbox
Open Access
- 25 June 2003
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 10636382,p. 17-28
- https://doi.org/10.1109/icde.2002.994694
Abstract
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for "RecOrd LInkAge Toolbox"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.Keywords
This publication has 15 references indexed in Scilit:
- Data mining using /spl Mscr/ℒ/spl Cscr/++ a machine learning library in C++Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- IntelliCleanPublished by Association for Computing Machinery (ACM) ,2000
- Real-world Data is Dirty: Data Cleansing and The Merge/Purge ProblemData Mining and Knowledge Discovery, 1998
- Performance standards and evaluations in IR test collections: Vector-space and other retrieval modelsInformation Processing & Management, 1997
- The merge/purge problem for large databasesPublished by Association for Computing Machinery (ACM) ,1995
- Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, FloridaJournal of the American Statistical Association, 1989
- Induction of decision treesMachine Learning, 1986
- Duplicate record elimination in large data filesACM Transactions on Database Systems, 1983
- Automatic Linkage of Vital RecordsScience, 1959