Untangling compound documents on the web

26 August 2003

proceedings article
Published by Association for Computing Machinery (ACM)

p. 85-94
https://doi.org/10.1145/900051.900070

Abstract

Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.

Keywords

This publication has 14 references indexed in Scilit:

Query relaxation by structure and semantics for retrieval of logical Web documents
IEEE Transactions on Knowledge and Data Engineering, 2002
Winners don't take all: Characterizing the competition for links on the web
Proceedings of the National Academy of Sciences, 2002
The Semantic Web
Scientific American, 2001
Finding context paths for Web pages
Published by Association for Computing Machinery (ACM) ,1999
The Dexter hypertext reference model
Communications of the ACM, 1994
Identifying aggregates in hypertext structures
Published by Association for Computing Machinery (ACM) ,1991
Reflections on NoteCards: seven issues for the next generation of hypermedia systems
Communications of the ACM, 1988
KMS: a distributed hypermedia system for managing knowledge in organizations
Communications of the ACM, 1988
Co‐citation in the scientific literature: A new measure of the relationship between two documents
Journal of the American Society for Information Science, 1973
On a Class of Skew Distribution Functions
Biometrika, 1955