Entropy-based link analysis for mining web informative structures
- 4 November 2002
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 574-581
- https://doi.org/10.1145/584792.584886
Abstract
In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by TOC pages through informative links. It is noted that the Hyperlink Induced Topics Search (HITS) algorithm has been employed to provide a solution to analyzing authorities and hubs of pages. However, most of the content sites tend to contain some extra hyperlinks, such as navigation panels, advertisements and banners, so as to increase the add-on values of their Web pages. Therefore, due to the structure induced by these extra hyperlinks, HITS is found to be insufficient to provide a good precision in solving the problem. To remedy this, we develop an algorithm to utilize entropy-based Link Analysis on Mining Web Informative Structures. This algorithm is referred to as LAMIS. The key idea of LAMIS is to utilize information entropy for representing the knowledge that corresponds to the amount of information in a link or a page in the link analysis. Experiments on several real news Web sites show that the precision and the recall of LAMIS are much superior to those obtained by heuristic methods and conventional ink analysis methods.Keywords
This publication has 12 references indexed in Scilit:
- Enhanced topic distillation using text, markup tags, and hyperlinksPublished by Association for Computing Machinery (ACM) ,2001
- IEPADPublished by Association for Computing Machinery (ACM) ,2001
- Constructing multi-granular and topic-focused web site mapsPublished by Association for Computing Machinery (ACM) ,2001
- Integrating the document object model with hyperlinks for enhanced topic distillation and information extractionPublished by Association for Computing Machinery (ACM) ,2001
- Does “authority” mean quality? predicting expert quality ratings of Web documentsPublished by Association for Computing Machinery (ACM) ,2000
- Mining the Web's link structureComputer, 1999
- Learning to remove Internet advertisementsPublished by Association for Computing Machinery (ACM) ,1999
- Efficient data mining for path traversal patternsIEEE Transactions on Knowledge and Data Engineering, 1998
- Silk from a sow's earPublished by Association for Computing Machinery (ACM) ,1996
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948