Untangling compound documents on the web
- 26 August 2003
- proceedings article
- Published by Association for Computing Machinery (ACM)
Abstract
Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.Keywords
This publication has 14 references indexed in Scilit:
- Query relaxation by structure and semantics for retrieval of logical Web documentsIEEE Transactions on Knowledge and Data Engineering, 2002
- Winners don't take all: Characterizing the competition for links on the webProceedings of the National Academy of Sciences, 2002
- The Semantic WebScientific American, 2001
- Finding context paths for Web pagesPublished by Association for Computing Machinery (ACM) ,1999
- The Dexter hypertext reference modelCommunications of the ACM, 1994
- Identifying aggregates in hypertext structuresPublished by Association for Computing Machinery (ACM) ,1991
- Reflections on NoteCards: seven issues for the next generation of hypermedia systemsCommunications of the ACM, 1988
- KMS: a distributed hypermedia system for managing knowledge in organizationsCommunications of the ACM, 1988
- Co‐citation in the scientific literature: A new measure of the relationship between two documentsJournal of the American Society for Information Science, 1973
- On a Class of Skew Distribution FunctionsBiometrika, 1955