Extracting structured data from Web pages

Top Cited Papers

9 June 2003

proceedings article
Published by Association for Computing Machinery (ACM)

p. 337-348
https://doi.org/10.1145/872757.872799

Abstract

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

Keywords

This publication has 7 references indexed in Scilit:

A brief survey of web data extraction tools
ACM SIGMOD Record, 2002
IEPAD
Published by Association for Computing Machinery (ACM) ,2001
XTRACT
Published by Association for Computing Machinery (ACM) ,2000
A hierarchical approach to wrapper induction
Published by Association for Computing Machinery (ACM) ,1999
The TSIMMIS Approach to Mediation: Data Models and Languages
Journal of Intelligent Information Systems, 1997
Modeling by shortest data description
Automatica, 1978
Language identification in the limit
Information and Control, 1967