Discovering informative content blocks from Web documents
Top Cited Papers
- 23 July 2002
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 588-593
- https://doi.org/10.1145/775047.775134
Abstract
In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.Keywords
This publication has 10 references indexed in Scilit:
- Integrating the document object model with hyperlinks for enhanced topic distillation and information extractionPublished by Association for Computing Machinery (ACM) ,2001
- Web mining researchACM SIGKDD Explorations Newsletter, 2000
- Discovering structural association of semistructured dataIEEE Transactions on Knowledge and Data Engineering, 2000
- Authoritative sources in a hyperlinked environmentJournal of the ACM, 1999
- The anatomy of a large-scale hypertextual Web search engineComputer Networks and ISDN Systems, 1998
- PAT-tree-based keyword extraction for Chinese information retrievalPublished by Association for Computing Machinery (ACM) ,1997
- Using Information Extraction to Improve Document RetrievalPublished by National Institute of Standards and Technology (NIST) ,1997
- Information extractionCommunications of the ACM, 1996
- New techniques for best-match retrievalACM Transactions on Information Systems, 1990
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948