Discovering informative content blocks from Web documents

Top Cited Papers

23 July 2002

proceedings article
Published by Association for Computing Machinery (ACM)

p. 588-593
https://doi.org/10.1145/775047.775134

Abstract

In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.

Keywords

This publication has 10 references indexed in Scilit:

Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction
Published by Association for Computing Machinery (ACM) ,2001
Web mining research
ACM SIGKDD Explorations Newsletter, 2000
Discovering structural association of semistructured data
IEEE Transactions on Knowledge and Data Engineering, 2000
Authoritative sources in a hyperlinked environment
Journal of the ACM, 1999
The anatomy of a large-scale hypertextual Web search engine
Computer Networks and ISDN Systems, 1998
PAT-tree-based keyword extraction for Chinese information retrieval
Published by Association for Computing Machinery (ACM) ,1997
Using Information Extraction to Improve Document Retrieval
Published by National Institute of Standards and Technology (NIST) ,1997
Information extraction
Communications of the ACM, 1996
New techniques for best-match retrieval
ACM Transactions on Information Systems, 1990
A Mathematical Theory of Communication
Bell System Technical Journal, 1948