Record-boundary discovery in Web documents

Extraction of information from unstructured or semistructured Web documentsoften requires a recognition and delimitation of records. (By "record" we mean agroup of information relevant to some entity.) Without first chunking documentsthat contain multiple records according to record boundaries, extraction of recordinformation will not likely succeed. In this paper we describe a heuristic approach todiscovering record boundaries in Web documents. In our approach, we capture thestructure of ...

This publication has 6 references indexed in Scilit: