Evaluating strategies for similarity search on the web

7 May 2002

proceedings article
Published by Association for Computing Machinery (ACM)

p. 432-442
https://doi.org/10.1145/511446.511502

Abstract

Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to efficiently construct a similarity index out of our chosen strategies, and provide sample results from our index.

Keywords

This publication has 7 references indexed in Scilit:

Topical locality in the Web
Published by Association for Computing Machinery (ACM) ,2000
Data clustering
ACM Computing Surveys, 1999
Finding related pages in the World Wide Web
Computer Networks, 1999
Measures of distributional similarity
Published by Association for Computational Linguistics (ACL) ,1999
Enhanced hypertext categorization using hyperlinks
Published by Association for Computing Machinery (ACM) ,1998
Min-wise independent permutations (extended abstract)
Published by Association for Computing Machinery (ACM) ,1998
An algorithm for suffix stripping
Program: electronic library and information systems, 1980