ONDUX
- 6 June 2010
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 807-818
- https://doi.org/10.1145/1807167.1807254
Abstract
Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior effectiveness, as demonstrated by the experimental evaluation we report with textual sources from different domains, in which ONDUX is compared with a state-of-art IETS approach.Keywords
This publication has 10 references indexed in Scilit:
- Exploiting Structured Reference Data for Unsupervised Text Segmentation with Conditional Random FieldsPublished by Society for Industrial & Applied Mathematics (SIAM) ,2008
- LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfacesInformation Processing & Management, 2007
- FLUX-CIMPublished by Association for Computing Machinery (ACM) ,2007
- Information ExtractionFoundations and Trends® in Databases, 2007
- Information extraction from research papers using conditional random fieldsInformation Processing & Management, 2006
- Answering Imprecise Queries over Autonomous Web DatabasesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- Integrating Unstructured Data into Relational DatabasesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- Mining reference tables for automatic text segmentationPublished by Association for Computing Machinery (ACM) ,2004
- Automatic segmentation of text into structured recordsPublished by Association for Computing Machinery (ACM) ,2001
- The New Statistical Analysis of DataPublished by Springer Nature ,1996