Record-boundary discovery in Web documents

1 June 1999

proceedings article
Published by Association for Computing Machinery (ACM)

Vol. 28 (2) , 467-478
https://doi.org/10.1145/304182.304223

Abstract

Extraction of information from unstructured or semistructured Web documentsoften requires a recognition and delimitation of records. (By "record" we mean agroup of information relevant to some entity.) Without first chunking documentsthat contain multiple records according to record boundaries, extraction of recordinformation will not likely succeed. In this paper we describe a heuristic approach todiscovering record boundaries in Web documents. In our approach, we capture thestructure of ...

Keywords

This publication has 6 references indexed in Scilit:

Ontology-based extraction and structuring of information from data-rich unstructured documents
Published by Association for Computing Machinery (ACM) ,1998
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents
Published by Association for Computing Machinery (ACM) ,1998
Wrapper generation for semi-structured Internet sources
ACM SIGMOD Record, 1997
Virtual database technology
ACM SIGMOD Record, 1997
Cut and paste
Published by Association for Computing Machinery (ACM) ,1997
A scalable comparison-shopping agent for the World-Wide Web
Published by Association for Computing Machinery (ACM) ,1997