Learning block importance models for web pages
Top Cited Papers
- 17 May 2004
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 203-211
- https://doi.org/10.1145/988672.988700
Abstract
Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.Keywords
This publication has 9 references indexed in Scilit:
- Automatic browsing of large pictures on mobile devicesPublished by Association for Computing Machinery (ACM) ,2003
- Eliminating noisy information in Web pages for data miningPublished by Association for Computing Machinery (ACM) ,2003
- Improving pseudo-relevance feedback in web information retrieval using web page segmentationPublished by Association for Computing Machinery (ACM) ,2003
- DOM-based content extraction of HTML documentsPublished by Association for Computing Machinery (ACM) ,2003
- Discovering informative content blocks from Web documentsPublished by Association for Computing Machinery (ACM) ,2002
- Template detection via data mining and its applicationsPublished by Association for Computing Machinery (ACM) ,2002
- Function-based object model towards website adaptationPublished by Association for Computing Machinery (ACM) ,2001
- An Evaluation of Statistical Approaches to Text CategorizationInformation Retrieval Journal, 1999
- Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern RecognitionIEEE Transactions on Electronic Computers, 1965