XWRAP: an XML-enabled wrapper construction system for Web information sources
Top Cited Papers
- 7 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 611-621
- https://doi.org/10.1109/icde.2000.839475
Abstract
The paper describes the methodology and the software development of XWRAP, an XML-enabled wrapper construction system for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original Web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query based content filtering process is performed against the XML documents. The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates tasks of building wrappers that are specific to a Web source from the tasks that are repetitive for any source, and uses a component library to provide basic building blocks for wrapper programs. Second, it provides a user friendly interface program to allow wrapper developers to generate their wrapper code with a few mouse clicks. Third and most importantly, we introduce and develop a two-phase code generation framework. The first phase utilizes an interactive interface facility to encode the source-specific metadata knowledge identified by individual wrapper developers as declarative information extraction rules. The second phase combines the information extraction rules generated at the first phase with the XWRAP component library to construct an executable wrapper program for the given Web source. We report the initial experiments on performance of the XWRAP code generation system and the wrapper programs generated by XWRAP.Keywords
This publication has 7 references indexed in Scilit:
- Semi-automatic wrapper generation for Internet information sourcesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Versions and workspaces in Microsoft repositoryPublished by Association for Computing Machinery (ACM) ,1999
- Microsoft repository version 2 and the open information modelInformation Systems, 1999
- CQPublished by Association for Computing Machinery (ACM) ,1998
- NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documentsACM SIGMOD Record, 1998
- Cut and pastePublished by Association for Computing Machinery (ACM) ,1997
- Template-based wrappers in the TSIMMIS systemPublished by Association for Computing Machinery (ACM) ,1997