Extracting knowledge from the World Wide Web

6 April 2004

journal article
research article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences

Vol. 101 (suppl_1) , 5186-5191
https://doi.org/10.1073/pnas.0307528100

Abstract

The World Wide Web provides a unprecedented opportunity to automatically analyze a large sample of interests and activity in the world. We discuss methods for extracting knowledge from the web by randomly sampling and analyzing hosts and pages, and by analyzing the link structure of the web and how links accumulate over time. A variety of interesting and valuable information can be extracted, such as the distribution of web pages over domains, the distribution of interest in different areas, communities related to different topics, the nature of competition in different categories of sites, and the degree of communication between different communities or countries.

Keywords

This publication has 18 references indexed in Scilit:

Self-organization and identification of Web communities
Computer, 2002
Winners don't take all: Characterizing the competition for links on the web
Proceedings of the National Academy of Sciences, 2002
Topology of Evolving Networks: Local Events and Universality
Physical Review Letters, 2000
Structure of Growing Networks with Preferential Linking
Physical Review Letters, 2000
On near-uniform URL sampling
Computer Networks, 2000
Graph structure in the Web
Computer Networks, 2000
Mean-field theory for scale-free random networks
Physica A: Statistical Mechanics and its Applications, 1999
Trawling the Web for emerging cyber-communities
Computer Networks, 1999
The anatomy of a large-scale hypertextual Web search engine
Computer Networks and ISDN Systems, 1998
ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS
Biometrika, 1955