Identification of duplicate and near‐duplicate full‐text records in database search‐outputs using hierarchic cluster analysis

1 March 1995

journal article
Published by Emerald Publishing in Program: electronic library and information systems

Vol. 29 (3) , 241-256
https://doi.org/10.1108/eb047198

Abstract

Clustering the output of a multi‐database online search enables a user to obtain an overview of the information that has been retrieved without the need to inspect any documents that contain only redundant information. In this paper we describe a classification scheme that characterises the degree of relationship between pairs of documents in database search‐outputs and then report the application of a range of clustering methods and similarity coefficients to 20 such outputs. These experiments demonstrate that clustering is capable of grouping documents that are identical to, or closely‐related to, other documents in the search‐output on the basis of their term similarities.

Keywords

This publication has 7 references indexed in Scilit:

Scientific current awareness in an international pharmaceutical R&D environment
Aslib Proceedings, 1993
Recent trends in hierarchic document clustering: A critical review
Information Processing & Management, 1988
Implementing agglomerative hierarchic clustering algorithms for use in document retrieval
Information Processing & Management, 1986
Using interdocument similarity information in document retrieval systems
Journal of the American Society for Information Science, 1986
RUBRIC: A System for Rule-Based Information Retrieval
IEEE Transactions on Software Engineering, 1985
Local Feedback in Full-Text Retrieval Systems
Journal of the ACM, 1977
The use of hierarchic clustering in information retrieval
Information Storage and Retrieval, 1971