Web page classification

Top Cited Papers

23 February 2009

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Computing Surveys

Vol. 41 (2) , 1-31
https://doi.org/10.1145/1459352.1459357

Abstract

Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.

Keywords

Funding Information

Division of Information and Intelligent Systems (IIS-0328825)

This publication has 98 references indexed in Scilit:

Reinforcing Web-object Categorization Through Interrelationships
Data Mining and Knowledge Discovery, 2006
Mapping the Semantics of Web Text and Links
IEEE Internet Computing, 2005
Using a web-based categorization approach to generate thematic metadata from texts
ACM Transactions on Asian Language Information Processing, 2004
Extracting fuzzy classification rules from partially labeled data
Soft Computing, 2004
The potential of the metasearch engine
Proceedings of the American Society for Information Science and Technology, 2004
Machine learning in automated text categorization
ACM Computing Surveys, 2002
Query clustering using user logs
ACM Transactions on Information Systems, 2002
Web mining research
ACM SIGKDD Explorations Newsletter, 2000
Text-learning and related intelligent agents: a survey
IEEE Intelligent Systems and their Applications, 1999
Indexing by latent semantic analysis
Journal of the American Society for Information Science, 1990