Autonomous citation matching

Abstract
Advances in computational resources and the commu- nications infrastructure, and the rapid rise of the World Wide Web, have led to the increasingly widespread avail- ability of scientific papers in electronic form. Scientific papers usually contain citations to previous work, and in- dices of these citations are valuable for literature search , analysis, and evaluation. Current citation indices of the scientific literature are constructed using manual effort and are typically expensive. Part of the reason for using manual effort is the great variability of citation syntax - i t can be difficult to autonomously determine if two citations refer to the same article because citations can be written in many different formats. We present machine learning techniques that identify variant forms of citations to the same paper. A number of algorithms are presented. An algorithm based on word and phrase matching is found to perform best, and is sufficiently accurate for unassisted use in an autonomous citation indexing system. An al- gorithm based on a string edit distance performs poorly in comparison. A computationally efficient subfield algo- rithm is also presented. The accuracy and efficiency of all algorithms is quantitatively compared on a number of datasets.

This publication has 5 references indexed in Scilit: