Using web structure for classifying and describing web pages
Top Cited Papers
- 7 May 2002
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 562-569
- https://doi.org/10.1145/511446.511520
Abstract
The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages.Keywords
This publication has 7 references indexed in Scilit:
- A Study of Approaches to Hypertext CategorizationJournal of Intelligent Information Systems, 2002
- Efficient identification of Web communitiesPublished by Association for Computing Machinery (ACM) ,2000
- Authoritative sources in a hyperlinked environmentJournal of the ACM, 1999
- Digital libraries and autonomous citation indexingComputer, 1999
- Combining labeled and unlabeled data with co-trainingPublished by Association for Computing Machinery (ACM) ,1998
- Enhanced hypertext categorization using hyperlinksACM SIGMOD Record, 1998
- Efficient crawling through URL orderingComputer Networks and ISDN Systems, 1998