Record-boundary discovery in Web documents
- 1 June 1999
- proceedings article
- Published by Association for Computing Machinery (ACM)
- Vol. 28 (2) , 467-478
- https://doi.org/10.1145/304182.304223
Abstract
Extraction of information from unstructured or semistructured Web documentsoften requires a recognition and delimitation of records. (By "record" we mean agroup of information relevant to some entity.) Without first chunking documentsthat contain multiple records according to record boundaries, extraction of recordinformation will not likely succeed. In this paper we describe a heuristic approach todiscovering record boundaries in Web documents. In our approach, we capture thestructure of ...Keywords
This publication has 6 references indexed in Scilit:
- Ontology-based extraction and structuring of information from data-rich unstructured documentsPublished by Association for Computing Machinery (ACM) ,1998
- NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documentsPublished by Association for Computing Machinery (ACM) ,1998
- Wrapper generation for semi-structured Internet sourcesACM SIGMOD Record, 1997
- Virtual database technologyACM SIGMOD Record, 1997
- Cut and pastePublished by Association for Computing Machinery (ACM) ,1997
- A scalable comparison-shopping agent for the World-Wide WebPublished by Association for Computing Machinery (ACM) ,1997